Strategy Improvement for 
Concurrent Reachability and Safety Games * ^ * 

Krishnendu Chatterjee^ Luca de Alfaro§ Thomas A. Henzinger^ 
1ST Austria (Institute of Science and Technology Austria) 

g 

CE, University of California, Santa Cruz,USA 
* Computer and Communication Sciences, EPFL, Switzerland 

{krish . chat , t ah} gist . ac . at , luca@soe.ucsc. edu 



Abstract 

We consider concurrent games played on graphs. At every round of a game, each player simul- 
taneously and independently selects a move; the moves jointly determine the transition to a successor 
state. Two basic objectives are the safety objective to stay forever in a given set of states, and its dual, the 
reachability objective to reach a given set of states. First, we present a simple proof of the fact that in con- 
current reachability games, for all e > 0, memoryless e-optimal sttategies exist. A memoryless sttategy 
is independent of the history of plays, and an e-optimal strategy achieves the objective with probability 
within e of the value of the game. In contrast to previous proofs of this fact, our proof is more elementary 
and more combinatorial. Second, we present a strategy-improvement (a.k.a. policy-iteration) algorithm 
for concurrent games with reachability objectives. We then present a strategy-improvement algorithm 
for concurrent games with safety objectives. Our algorithms yield sequences of player- 1 strategies which 
ensure probabilities of winning that converge monotonically to the value of the game. Our result is sig- 
nificant because the strategy-improvement algorithm for safety games provides, for the first time, a way 
to approximate the value of a concurrent safety game from below. Previous methods could approximate 
the values of these games only from one direction, and as no rates of convergence are known, they did 
not provide a practical way to solve these games. 

Keywords. Concurrent games; Reachability and safety objectives; Strategy improvement algorithms. 

1 Introduction 



We consider games played between two players on graphs. At every round of the game, each of the two 
players selects a move; the moves of the players then determine the transition to the successor state. A play of 
the game gives rise to a path in the graph. We consider the two basic objectives for the players: reachability 
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and safety. The reachability goal asks player 1 to reach a given set of target states or, if randomization 
is needed to play the game, to maximize the probability of reaching the target set. The safety goal asks 
player 2 to ensure that a given set of safe states is never left or, if randomization is required, to minimize 
the probability of leaving the target set. The two objectives are dual, and the games are determined: the 
supremum probability with which player 1 can reach the target set is equal to one minus the supremum 
probability with which player 2 can confine the game to the complement of the target set [14]. 

These games on graphs can be divided into two classes: turn-based and concurrent. In turn-based 
games, only one player has a choice of moves at each state; in concurrent games, at each state both players 
choose a move, simultaneously and independently, from a set of available moves. For turn-based games, the 
solution of games with reachability and safety objectives has long been known. If each move determines 
a unique successor state, then the games are P-complete and can be solved in linear time in the size of 
the game graph. If, more generally, each move determines a probability distribution on possible successor 
states, then the problem of deciding whether a turn-based game can be won with probability greater than 
a given threshold p £ [0, 1] is in NP n co-NP [5], and the exact value of the game can be computed by a 
strategy-improvement algorithm [6], which works well in practice. These results all depend on the fact that 
in turn-based reachability and safety games, both players have optimal deterministic (i.e., no randomization 
is required), memoryless strategies. These strategies are functions from states to moves, so they are finite in 
number, and this guarantees the termination of the strategy-improvement algorithm. 

The situation is very different for concurrent games. The player- 1 value of the game is defined, as usual, 
as the sup-inf value: the supremum, over all strategies of player 1, of the infimum, over all strategies of 
player 2, of the probability of achieving the reachability or safety goal. In concurrent reachability games, 
player 1 is guaranteed only the existence of e-optimal strategies, which ensure that the value of the game 
is achieved within a specified tolerance e > [14]. Moreover, while these strategies (which depend on e) 
are memoryless, in general they require randomization [14] (even in the special case in which the transition 
function is deterministic). For player 2 (the safety player), optimal memoryless strategies exist [24], which 
again require randomization (even when the transition function is deterministic). All of these strategies are 
functions from states to probability distributions on moves. The question of deciding whether a concurrent 
game can be won with probability greater than p is in PSPACE; this is shown by reduction to the theory of 
the real-closed fields [13]. 

To summarize: while strategy-improvement algorithms are available for turn-based reachability and 
safety games [6], so far no strategy-improvement algorithms or even approximation schemes were known 
for concurrent games. If one wanted to compute the value of a concurrent game within a specified tolerance 
e > 0, one was reduced to using a binary search algorithm that approximates the value by iterating queries 
in the theory of the real-closed fields. Value-iteration schemes were known for such games, but they can be 
used to approximate the value from one direction only, for reachability goals from below, and for safety goals 
from above [11]. The value-iteration schemes are not guaranteed to terminate. Worse, since no convergence 
rates are known for these schemes, they provide no termination criteria for approximating a value within e. 

Our results for concurrent reachability games. Concurrent reachability games belong to the family of 
stochastic games [26, 14], and they have been studied more specifically in [10, 9, 11]. Our contributions for 
concurrent reachability games are two-fold. First, we present a simple and combinatorial proof of the exis- 
tence of memoryless e-optimal strategies for concurrent games with reachability objectives, for all e > 0. 
Second, using the proof techniques we developed for proving existence of memoryless e-optimal strategies, 
for e > 0, we obtain a strategy-improvement (a.k.a. policy-iteration) algorithm for concurrent reachability 
games. Unlike in the special case of turn-based games the algorithm need not terminate in finitely many 
iterations. 
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It has long been known that optimal strategies need not exist for concurrent reachability games, and for 
all e > 0, there exist e-optimal strategies that are memory less [14]. A proof of this fact can be obtained 
by considering limit of discounted games. The proof considers discounted versions of reachability games, 
where a play that reaches the target in k steps is assigned a value of a k , for some discount factor < a < 
1. It is possible to show that, for < a < 1, memoryless optimal strategies exist. The result for the 
undiscounted (a = 1) case followed from an analysis of the limit behavior of such optimal strategies for 
a — > 1. The limit behavior is studied with the help of results from the field of real Puisieux series [23]. This 
proof idea works not only for reachability games, but also for total-reward games with nonnegative rewards 
(see [15, Chapter 5] for details). A more recent result [13] establishes the existence of memoryless e- 
optimal strategies for certain infinite-state (recursive) concurrent games, but again the proof relies on results 
from analysis and properties of solutions of certain polynomial functions. Another proof of existence of 
memoryless e-optimal strategies for reachability objectives follows from the result of [14] and the proof uses 
induction on the number of states of the game. We show the existence of memoryless e-optimal strategies 
for concurrent reachability games by more combinatorial and elementary means. Our proof relies only on 
combinatorial techniques and on simple properties of Markov decision processes [1, 8]. As our proof is 
more combinatorial, we believe that the proof techniques will find future applications in game theory. 

Our proof of the existence of memoryless e-optimal strategies, for all e > 0, is built upon a value- 
iteration scheme that converges to the value of the game [11]. The value-iteration scheme computes a 
sequence no, u\, 112, ■ ■ ■ of valuations, where for i = 0, 1, 2, . . . each valuation Ui associates with each state 
s of the game a lower bound u.- L (s) on the value of the game, such that lim^oo n« (s) converges to the value of 
the game at s. The convergence is monotonic from below, but no rate of convergence was known. From each 
valuation n«, we can extract a memoryless, randomized player-1 strategy, by considering the (randomized) 
choice of moves for player 1 that achieves the maximal one-step expectation of Uj. In general, a strategy 7Tj 
obtained in this fashion is not guaranteed to achieve the value U{. We show that 7Tj is guaranteed to achieve 
the value u, L if it is proper, that is, if regardless of the strategy adopted by player 2, the play reaches with 
probability 1 states that are either in the target, or that have no path leading to the target. Next, we show how 
to extract from the sequence of valuations uq, u±, U2, ■ ■ ■ a sequence of memoryless randomized player-1 
strategies ttq, 7Ti, 7T2, . . . that are guaranteed to be proper, and thus achieve the values uq,u\,U2, ■ ■ ■■ This 
proves the existence of memoryless e-optimal strategies for all e > 0. Our proof is completely different as 
compared to the proof of [14]: the proof of [14] uses induction on the number of states, whereas our proof 
is based on the notion of ranking function obtained from the value-iteration algorithm. 

We then apply the techniques developed for the above proof to design a strategy-improvement algo- 
rithm for concurrent reachability games. Strategy-improvement algorithms, also known as policy-iteration 
algorithms in the context of Markov decision processes [20], compute a sequence of memoryless strategies 
7r , n[, ir' 2 , ■ ■ ■ such that, for all k > 0, (i) the strategy Tt' k+1 is at all states no worse than ir' k ; (ii) if n k+1 = ir' k , 
then Tr k is optimal; and (iii) for every e > 0, we can find a k sufficiently large so that 7r' k is e-optimal. Com- 
puting a sequence of strategies -kq, it\ , tt2 , ■ ■ ■ on the basis the value-iteration scheme from above does not 
yield a strategy-improvement algorithm, as condition (ii) may be violated: there is no guarantee that a step 
in the value iteration leads to an improvement in the strategy. We will show that the key to obtain a strategy- 
improvement algorithm consists in recomputing, at each iteration, the values of the player-1 strategy to be 
improved, and in adopting a particular strategy-update rule, which ensures that all generated strategies are 
proper. Unlike previous proofs of strategy-improvement algorithms for concurrent games [6, 15], which rely 
on the analysis of discounted versions of the games, our analysis is again more combinatorial. Hoffman- 
Karp [19] presented a strategy improvement algorithm for the special case of concurrent games with ergodic 
property (i.e., from every state s any other state t can be guaranteed to reach with probability 1) (also see 
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algorithm for discounted games in [25]). Observe that for concurrent reachability games, with the ergodic 
assumption the value at all states is trivially 1, and thus the ergodic assumption gives us the trivial case. Our 
results give a combinatorial strategy improvement algorithm for the whole class of concurrent reachability 
games. The results of [13] presents a strategy improvement algorithm for recursive concurrent games with 
termination criteria: the algorithm of [13] is more involved (depends on properties of certain polynomial 
functions) and works for the more general class of recursive concurrent games. Differently from turn-based 
games [6], for concurrent games we cannot guarantee the termination of the strategy-improvement algo- 
rithm. However, for turn-based stochastic games we present a detailed analysis of termination criteria. Our 
analysis is based on bounds on the precision of values for turn-based stochastic games. As a consequence 
of our analysis, we obtain an improved upper bound for termination for turn-based stochastic games. 

Our results for concurrent safety games. We present for the first time a strategy-improvement scheme that 
approximates the value of a concurrent safety game from below. Together with the strategy improvement 
algorithm for reachability games, or the value-iteration scheme, to approximate the value of such a game 
from above, we obtain a termination criterion for computing the value of concurrent reachability and safety 
games within any given tolerance e > 0. This is the first termination criterion for an algorithm that approx- 
imates the value of a concurrent game. Several difficulties had to be overcome in developing our scheme. 
First, while the strategy-improvement algorithm that approximates reachability values from below is based 
on locally improving a strategy on the basis of the valuation it yields, this approach does not suffice for 
approximating safety values from below: we would obtain an increasing sequence of values, but they would 
not necessarily converge to the value of the game (see Example 2). Rather, we introduce a novel, non-local 
improvement step, which augments the standard valuation-based improvement step. Each non-local step 
involves the solution of an appropriately constructed turn-based game. The turn-based game constructed is 
polynomial in the state space of the original game, but exponential in the number of actions. It is an interest- 
ing open question whether the turn-based game can be also made polynomial in the number of the actions. 
Second, as value-iteration for safety objectives converges from above, while our sequences of strategies 
yield values that converge from below, the proof of convergence for our algorithm cannot be derived from 
a connection with value-iteration, as was the case for reachability objectives. We had to develop new proof 
techniques both to show the monotonicity of the strategy values produced by our algorithm, and to show 
their convergence to the value of the game. 

Added value of our algorithms. The new strategy improvement algorithms we present in this paper has 
two important contributions as compared to the classical value-iteration algorithms. 

1. Termination for approximation. The value-iteration algorithm for reachability games converges from 
below, and the value-iteration for safety games converges for above. Hence given desired precision 
e > for approximation, there is no termination criteria to stop the value-iteration algorithm and 
guarantee e-approximation. The sequence of valuation of our strategy improvement algorithm for 
concurrent safety games converges from below, and along with the value-iteration or strategy im- 
provement algorithm for concurrent reachability games we obtain the first termination criteria for 
e-approximation of values in concurrent reachability and safety games. Using a result of [18] on the 
bound on fc-uniform memoryless e-optimal strategies, for e > 0, we also obtain a bound on the num- 
ber of iterations of the strategy improvement algorithms that guarantee e-approximation of the values. 
Moreover a recent result of [17] provide a nearly tight double exponential upper and lower bound on 
the number of iterations required for e-approximation of the values. 

2. Approximation of strategies. Our strategy improvement algorithms are also the first approach to ap- 
proximate memoryless e-optimal strategies in concurrent reachability and safety games. The witness 



4 



strategy produced by the value-iteration algorithm for concurrent reachability games is not memory- 
less; and for concurrent safety games since the value-iteration algorithm converges from above it does 
not provide any witness strategies. Our strategy improvement algorithms for concurrent reachability 
and safety games yield sequence of memoryless strategies that ensure for convergence to the value of 
the game from below, and yield witness memoryless strategies to approximate the value of concurrent 
reachability and safety games. 

2 Definitions 

Notation. For a countable set A, a probability distribution on A is a function 5 : A — >■ [0, 1] such that 
^2aeA <H°0 = 1- We denote the set of probability distributions on A by T>{A). Given a distribution 5 G 
V(A), we denote by Supp(8) = {x G A \ S(x) > 0} the support set of 5. 

Definition 1 (CONCURRENT GAMES). A (two-player) concurrent game structure G = (S, M, T±, T 2 , 5) 
consists of the following components: 

• A finite state space S and a finite set M of moves or actions. 

• Two move assignments T\, T 2 : S — > 2 M \ 0. For i G {1,2}, assignment Tj associates with each state 
s e S a nonempty set Ti(s) C M of moves available to player i at state s. 

• A probabilistic transition function 5 : S x M x M — > V(S) that gives the probability 5(s, ai,a 2 )(t) 
of a transition from s to t when player 1 chooses at state s move a\ and player 2 chooses move a 2 , for 
all s,t E S and ai £ Ti(s), a 2 G r 2 (s). 

We denote by \8\ the size of transition function, i.e., \S\ = J2 s eS aeTUs) f>er 2 (s) tes l^( s > a > ^)(*)l> wnere 
\5(s,a,b)(t)\ is the number of bits required to specify the transition probability S(s,a,b)(t). We denote 
by |G| the size of the game graph, and |G| = \5\ + \S\. At every state s £ S, player 1 chooses a move 
a\ G Ti(s), and simultaneously and independently player 2 chooses a move a 2 G ^(s). The game then 
proceeds to the successor state t with probability S(s, a\,a 2 )(t), for all t G S. A state s is an absorbing 
state if for all a\ G Ti(s) and a 2 G T2(s), we have S(s, a±, a 2 )(s) = 1. In other words, at an absorbing 
state s for all choices of moves of the two players, the successor state is always s. 

Definition 2 (TURN-BASED STOCHASTIC GAMES). A turn-based stochastic game graph (2 1 / 2 -player 
game graph) G = ((5, E), (Si, S 2 , Sr),S) consists of a finite directed graph (S, E), a partition (Si, S 2 , 
Sr) of the finite set S of states, and a probabilistic transition function 5: Sr — > T>(S), where T>(S) denotes 
the set of probability distributions over the state space S. The states in Si are the player-l states, where 
player 1 decides the successor state; the states in S 2 are the player-2 states, where player 2 decides the 
successor state; and the states in Sr are the random or probabilistic states, where the successor state is 
chosen according to the probabilistic transition function 5. We assume that for s G Sr and t G S, we have 
(s, t) G E iff 8(s)(t) > 0, and we often write S(s, t) for 5(s)(t). For technical convenience we assume that 
every state in the graph (S, E) has at least one outgoing edge. For a state s G S, we write E(s) to denote 
the set {t G S \ (s, t) G E} of possible successors. We denote by \S\ the size of the transition function, i.e., 
\S\ = YlseS tes wner e |<5(s)(t)| is the number of bits required to specify the transition probability 

S(s)(t). We denote by |G| the size of the game graph, and \G\ = \S\ + |5| + \E\. 
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Plays. A play oj of G is an infinite sequence u = (sq, s±, S2, . . .) of states in S such that for all k > 0, there 
are moves a\ G Ti(sfc) and G ^2{sk) with <5(sfc, aj, a^Xs/t+i) > 0. We denote by the set of all plays, 
and by Q s the set of all plays oj = (sq, s\, S2, ■ ■ •) such that so = s, that is, the set of plays starting from 
state s. 

Selectors and strategies. A selector £ for player i G {1, 2} is a function £ : 5 — > V{M) such that for all 
states s G S and moves a G M, if £(s)(a) > 0, then a G A selector £ for player i at a state s is a 

distribution over moves such that if £(s)(a) > 0, then a G Tj(s). We denote by Aj the set of all selectors 
for player i G {1, 2}, and similarly, we denote by Aj(s) the set of all selectors for player i at a state s. The 
selector £ is /?wre if for every state s G S, there is a move a G M such that £(s)(a) = 1. A strategy for 
player i G {1, 2} is a function it : S + — > V(M) that associates with every finite, nonempty sequence of 
states, representing the history of the play so far, a selector for player i; that is, for all w G S* and s G S, 
we have Supp(ir(w ■ s)) C Ti(s). The strategy ir is pwre if it always chooses a pure selector; that is, for all 
w G S + , there is a move a G M such that 7r(tu) (a) = 1. A memoryless strategy is independent of the history 
of the play and depends only on the current state. Memoryless strategies correspond to selectors; we write 
£ for the memoryless strategy consisting in playing forever the selector £. A strategy is pure memoryless 
if it is both pure and memoryless. In a turn-based stochastic game, a strategy for player 1 is a function 
7Ti : S* ■ Si ->■ T>(S), such that for all w G S* and for all s G S 1 we have Supp(-Ki(w ■ s)) C £J(s). 
Memoryless strategies and pure memoryless strategies are obtained as the restriction of strategies as in the 
case of concurrent game graphs. The family of strategies for player 2 are defined analogously. We denote 
by III and II2 the sets of all strategies for player 1 and player 2, respectively. We denote by Hf 1 and Il[ M 
the sets of memoryless strategies and pure memoryless strategies for player i, respectively. 

Destinations of moves and selectors. For all states s G S and moves a\ G Ti(s) and 02 G ^(s), we 
indicate by Dest(s, a\, 0,2) = Supp(5(s, ai, 02)) the set of possible successors of s when the moves a\ and 
ci2 are chosen. Given a state s, and selectors £1 and £2 for the two players, we denote by 

Dest(s,€i,€ 2 ) = [J Dest(s,ai,a 2 ) 

a,2&Supp(( > 2{s)) 

the set of possible successors of s with respect to the selectors £1 and £2- 

Once a starting state s and strategies tt\ and 7T2 for the two players are fixed, the game is reduced to an 
ordinary stochastic process. Hence, the probabilities of events are uniquely defined, where an event iCl] s 
is a measurable set of plays. For an event A C Q s , we denote by Prg 1,7r2 (A) the probability that a play 
belongs to A when the game starts from s and the players follows the strategies 7Ti and 112- Similarly, for 
a measurable function / : £l s — > IR, we denote by Eg 1 ' 71 ' 2 (/) the expected value of / when the game starts 
from s and the players follow the strategies tt\ and 1x2- For i > 0, we denote by 6j : f2 — > S the random 
variable denoting the i-th state along a play. 

Valuations. A valuation is a mapping v : S — > [0, 1] associating a real number v(s) G [0, 1] with each state 
s. Given two valuations v , w : S — > IR, we write v < w when v(s) < w(s) for all states s G S. For an event 
A, we denote by V^ 2 (A) the valuation S -»• [0, 1] defined for all states s G S by (Pr 711 ^ 2 (^)) (s) = 
PrJ 1,7r2 (^l). Similarly, for a measurable function / : VL S — >■ [0, 1], we denote by E 7ri,7r2 (/) the valuation 
5 ->• [0, 1] defined for all s G S by (E^ 1 ' 7r2 (/)) (s) = Ej 1,7r2 (/). 
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The Pre operator. Given a valuation v, and two selectors £1 G Ai and £2 G A2, we define the valuations 

Pre£ 1; £ 2 (i;), Prei^ (v), and Prei(u) as follows, for all states s G 5: 

Pre 6 , 6 (i;)( S )= £ £ V (t) • a, b)(t) ■ &( a )(a) ■ &(*)(&) 

a,b£M teS 

Pr ei . 4l (v)(s) = inf Pre^ 2 {v)(s) 

Prei(v)(s) = sup inf Pre^ ^, 2 (v)(s) 
SieAi 6eA 2 

Intuitively, Prei(i>)(s) is the greatest expectation of f that player 1 can guarantee at a successor state of s. 
Also note that given a valuation v, the computation of Pre\ (v) reduces to the solution of a zero-sum one-shot 
matrix game, and can be solved by linear programming. Similarly, Prei-^ 1 (v)(s) is the greatest expectation 
of v that player 1 can guarantee at a successor state of s by playing the selector £1. Note that all of these 
operators on valuations are monotonic: for two valuations v, w,ifv< w, then for all selectors £1 G Ai and 
£2 G A2, we have Pre^£ 2 {v) < Pre^^ 2 (w), Prei : ^(v) < Prei : ^(w), and Pre\(v) < Pre\(w). 

Reachability and safety objectives. Given a set F C S of safe states, the objective of a safety game consists 
in never leaving F. Therefore, we define the set of winning plays as the set Safe(P) = {(so, si, S2, ■ ■ ■) € 
VL I Sk G F for all k > 0}. Given a subset T C S of targe? states, the objective of a reachability game 
consists in reaching T. Correspondingly, the set winning plays is Reach(T) = {(so, Si, S2, • • •) G fl \ Sk G 
T for some k > 0} of plays that visit T. For all F C 5 and T C S, the sets Safe(P) and Reach(T) is 
measurable. An objective in general is a measurable set, and in this paper we consider only reachability and 
safety objectives. For an objective $, the probability of satisfying <& from a state s £ S under strategies 
7Ti and 7T2 for players 1 and 2, respectively, is PrJ 1,7r2 (<i>). We define the value for player 1 of game with 
objective from the state s G S as 

((l»val(*)00 = sup inf Pr^ 2 (<D); 
Ti-ierii T2€ii2 

i.e., the value is the maximal probability with which player 1 can guarantee the satisfaction of <3? against all 
player 2 strategies. Given a player- 1 strategy tt\, we use the notation 

7r 2 fcll 2 

A strategy ir\ for player 1 is optimal for an objective if for all states s G 5, we have 

For e > 0, a strategy tt± for player 1 is e-optimal if for all states s G 5, we have 

«l»2i(*)W^«l»v-(*)W-e- 

The notion of values and optimal strategies for player 2 are defined analogously. Reachability and safety 
objectives are dual, i.e., we have Reach(T) = fi \ Safe(5 \ T). The quantitative determinacy result of [14] 
ensures that for all states s G S, we have 

<(l)) va l(Safe(F))(s) + ((2)) va ,(Reach(5 \ F))(s) = 1. 
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3 Markov Decision Processes 



To develop our arguments, we need some facts about one-player versions of concurrent stochastic games, 
known as Markov decision processes (MDPs) [12, 1]. For i G {1, 2}, a player-i MDP (for short, i-MDP) is 
a concurrent game where, for all states s G S, we have |r3_j(s)| = 1. Given a concurrent game G, if we 
fix a memoryless strategy corresponding to selector £i for player 1 , the game is equivalent to a 2-MDP 
with the transition function 



for all s G S and 02 G ^(s). Similarly, if we fix selectors £1 and £2 for both players in a concurrent game 
G, we obtain a Markov chain, which we denote by G^ ^ 2 . 

End components. In an MDP, the sets of states that play an equivalent role to the closed recurrent classes 
of Markov chains [21, Chapter 4] are called "end components" [7, 8]. 

Definition 3 (End COMPONENTS). An end component of an i-MDP G, for i G {1, 2}, is a subset CCS 
of the states such that there is a selector £ for player i so that C is a closed recurrent class of the Markov 
chain G^. 

It is not difficult to see that an equivalent characterization of an end component C is the following. For each 
state s£C, there is a subset Mj(s) C Ti(s) of moves such that: 

1. (closed) if a move in Mj(s) is chosen by player i at state s, then all successor states that are obtained 
with nonzero probability lie in C; and 

2. (recurrent) the graph (C, E), where E consists of the transitions that occur with nonzero probability 
when moves in Mj(-) are chosen by player i, is strongly connected. 

Given a play w € fl, we denote by Inf(oj) the set of states that occurs infinitely often along u. Given a set 
T C 2 s of subsets of states, we denote by Im^J 7 ) the event {ui | Inf(w) G J 7 }. The following theorem 
states that in a 2-MDP, for every strategy of player 2, the set of states that are visited infinitely often is, with 
probability 1 , an end component. Corollary 1 follows easily from Theorem 1 . 

Theorem 1 ([8]). For a player- 1 selector £1, let C be the set of end components of a 2-MDP G^. For all 
player-2 strategies tx^ and all states s £ S, we have Prf 1,7r2 (Inf(C)) = 1. 

Corollary 1 For a player-1 selector £1, let C be the set of end components of a 2-MDP G^, and let 
Z = Ucec C be the set of states of all end components. For all player-2 strategies tt2 and all states s G S, 

we have Prf 1 ' n2 (Reach(Z)) = 1. 

MDPs with reachability objectives. Given a 2-MDP with a reachability objective Reach(T) for player 2, 
where T C S, the values can be obtained as the solution of a linear program [15] (see Section 2.9 of [15] 
where linear program solution is given for MDPs with limit-average objectives and reachability objective is 
a special case of limit-average objectives). The linear program has a variable x(s) for all states s G S, and 
the objective function and the constraints are as follows: 




aieri(s) 



mm 




x(s) subject to 
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x(s) >^2x(t) ■ S(s,a 2 )(t) for all s G S and a 2 G T 2 (s) 

= 1 for all s G T 
< x(s) < 1 for all s G 5 

The correctness of the above linear program to compute the values follows from [15] (see section 2.9 of [15], 
and also see [7] for the correctness of the linear program). 

4 Existence of Memoryless ^-Optimal Strategies for Concurrent Reachabil- 
ity Games 

In this section we present an elementary and combinatorial proof of the existence of memoryless e-optimal 
strategies for concurrent reachability games, for all e > (optimal strategies need not exist for concurrent 
games with reachability objectives [14]). 

4.1 From value iteration to selectors 

Consider a reachability game with target T C S, i.e., objective for player 1 is Reach(T). Let W 2 = {s G 

5 | ((l))vai (Reach(T)) (s) = 0} be the set of states from which player 1 cannot reach the target with positive 
probability. From [9], we know that this set can be computed as W 2 = lim/^oo W§ , where = S \ T, 
and for all k > 0, 

W 2 fc+1 = {s G S \ T | 3a 2 G T 2 (s) . Voi G T^s) . Dest(s, 01,02) C W$} . 

The limit is reached in at most \S\ iterations. Note that player 2 has a strategy that confines the game to W 2 , 
and that consequently all strategies are optimal for player 1, as they realize the value of the game in W 2 . 
Therefore, without loss of generality, in the remainder we assume that all states in W 2 and T are absorbing. 

Our first step towards proving the existence of memoryless e-optimal strategies for reachability games 
consists in considering a value-iteration scheme for the computation of ((l)) va |(Reach(T)). Let [T] : S — > 
[0, 1] be the indicator function of T, defined by [T](s) = 1 for s G T, and [T](s) = for s T. Let 
u = [T], and for all k > 0, let 

Uk+i = Prei{u k ). (1) 

Note that the classical equation assigns Uk+i = [T] V Pre\(uk), where V is interpreted as the maxi- 
mum in pointwise fashion. Since we assume that all states in T are absorbing, the classical equation re- 
duces to the simpler equation given by (1). From the monotonicity of Pre\ it follows that < Uk+i, 
that is, Pre\(uk) > for all k > 0. The result of [11] establishes by a combinatorial argu- 
ment that ((l)) va | (Reach (T)) = lim^oo u&, where the limit is interpreted in pointwise fashion. For 
all k > 0, let the player-1 selector be a value-optimal selector for u^, that is, a selector such that 
Pre\{uk) = Prei : £ fc (ufc). An e-optimal strategy ir\ for player 1 can be constructed by applying the 
sequence Ck, Cfc-ii ■ • • j Cii Co> Co> Co> • • • of selectors, where the last selector, (0, is repeated forever. It is 
possible to prove by induction on k that 

inf Pr^ 2 (3j G [0..k].Qj G T) > u k . 

7r2Gll2 



9 




Figure 1 : An MDP with reachability objective. 

As the strategies tt\, for k > 0, are not necessarily memoryless, this proof does not suffice for showing 
the existence of memoryless e-optimal strategies. On the other hand, the following example shows that the 
memoryless strategy ( k does not necessarily guarantee the value u&. 

Example 1 Consider the 1-MDP shown in Fig 1. At all states except S3, the set of available moves for 
player 1 is a singleton set. At S3, the available moves for player 1 are a and b. The transitions at the various 
states are shown in the figure. The objective of player 1 is to reach the state so- 

We consider the value-iteration procedure and denote by u k the valuation after k iterations. Writing a 
valuation u as the list of values (u(sq), u(s\), . . . , ii(s4)), we have: 

n = (1,0,0,0,0) 
Ul = Pr ei (u ) = (1,0, 1/2,0,0) 
u 2 = Prei(tti) = (1,0, 1/2, V2,0) 

u 3 = Prei(u 2 ) = (i,o, y 2 , y 2 , y 2 ) 

u A = Prei(u 3 ) = u 3 = (1, 0, y 2 , 1 / 2 , 1 / 2 ) 

The valuation u 3 is thus a fixpoint. 

Now consider the selector £1 for player 1 that chooses at state S3 the move a with probability 1. The 
selector £1 is optimal with respect to the valuation u%. However, if player 1 follows the memoryless strategy 
£ l5 then the play visits S3 and S4 alternately and reaches sq with probability 0. Thus, £1 is an example of a 
selector that is value-optimal, but not optimal. 

On the other hand, consider any selector for player 1 that chooses move b at state S3 with positive 
probability. Under the memoryless strategy £ 1; the set {so, «i} of states is reached with probability 1, and 
so is reached with probability 1 / 2 . Such a is thus an example of a selector that is both value-optimal and 
optimal. I 

In the example, the problem is that the strategy £ x may cause player 1 to stay forever in S \ (T U W2) 
with positive probability. We call "proper" the strategies of player 1 that guarantee reaching T U W2 with 
probability 1. 

Definition 4 (Proper strategies and selectors). A player-1 strategy -k\ is proper if for all player-2 
strategies vr 2 , and for all states s G S \ (T U W 2 ), we have PrJ 1 ,7r2 (Reach (T U W 2 )) = 1. A player-1 
selector £1 is proper if the memoryless player-1 strategy ^ is proper. 

We note that proper strategies are closely related to Condon's notion of a halting game [5]: precisely, a game 
is halting iff all player-1 strategies are proper. We can check whether a selector for player 1 is proper by 
considering only the pure selectors for player 2. 
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Lemma 1 Given a selector £\for player 1, the memoryless player-1 strategy £ x w proper iff for every pure 
selector (,2 for player 2, and for all states s £ S, we have Prf 1 '^ 2 (Reach(T U W2)) = 1. 

Proof. We prove the contrapositive. Given a player-1 selector £1, consider the 2-MDP G^. If ^ is not 
proper, then by Theorem 1, there must exist an end component C C 5\ (TU W2) in G^. Then, from 
C, player 2 can avoid reaching T U W2 by repeatedly applying a pure selector £2 that at every state s G C 
deterministically chooses a move 02 G I^C 5 ) such that Dest(s, £1,0,2) Q C. The existence of a suitable 
£2(3) for all states seC follows from the definition of end component. I 

The following lemma shows that the selector that chooses all available moves uniformly at random is 
proper. This fact will be used later to initialize our strategy-improvement algorithm. 

Lemma 2 Let £" ni ^ be the player-1 selector that at all states s G S \ (T U W%) chooses all moves in T\ (s) 
uniformly at random. Then £""'^ is proper. 

Proof. Assume towards contradiction that £T^ is not proper. From Theorem 1, in the 2-MDP G u „if there 

must be an end component C Q S \ (T U W2). Then, when player 1 follows the strategy £™' , player 2 
can confine the game to C. By the definition of player 2 can ensure that the game does not leave C 
regardless of the moves chosen by player 1, and thus, for all strategies of player 1. This contradicts the fact 
that W2 contains all states from which player 2 can ensure that T is not reached. I 

The following lemma shows that if the player-1 selector Q k computed by the value-iteration scheme (1) 
is proper, then the player-1 strategy ( k guarantees the value u k , for all k > 0. 

Lemma 3 Let v be a valuation such that Pre\{v) > v and v(s) = Ofor all states s G W2. Let £1 be a 
selector for player 1 such that Prei^ (v) = Pre\ (v). If £\ is proper, then for all player-2 strategies 1x2, we 
have ^^{ReachiT)) > v. 

Proof. Consider an arbitrary player-2 strategy -K2, and for k > 0, let 

v k = E^(v(@ k )) 

be the expected value of v after k steps under £ x and tt2- By induction on k, we can prove v k > v for all 
k > 0. In fact, vq = v, and for k > 0, we have 

Vk+i > Pre 1: ^(v k ) > Pre 1: ^(v) = Pre^v) > v. 
For all k > and s G S, we can write Vk as 

v k (s) = El^{v(e k ) I @fc G T) • P4^ 2 {@ k G T) 

+ eI 1 -" 2 (v(e k ) \e k es\(Tu w 2 )) ■ Prf 1 -" 2 (e t e5\(Tu w 2 )) 
+ Eh 7T2 {v(e k ) I e fc G w 2 )-Prl^(e k G w 2 ). 

Since v(s) < 1 when s G T, the first term on the right-hand side is at most Prf 1,772 (@ k G T) . For the second 
term, we have lim fe _ >00 Pr?i' 7r2 (e fc G S\(TUW 2 )) = by hypothesis, because Pr?^ 2 (Reach (TUW^)) = 
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1 and every state s G (T U W2) is absorbing. Finally, the third term on the right hand side is 0, as v{s) = 
for all states s G W2. Hence, taking the limit with k — > 00, we obtain 

Pr?i- 7r2 (Reach(T)) = lim PA>**(e k G T) > lim v k > v, 

k— >oo fc— >oo 

where the last inequality follows from v k >v for all k > 0. Note that = Pr^ 1 '^ 2 6 T), and since T 
is absorbing it follows that v k is non-deccreasing (monotonic) and is bounded by 1 (since it is a probability 
measure). Hence the limit of v k is defined. The desired result follows. I 



4.2 From value iteration to optimal selectors 

In this section we show how to obtain memoryless e-optimal strategies from the value-iteration scheme, for 
e > 0. In the following section the existence such strategies would be established using a strategy-iteration 
scheme. The strategy-iteration scheme has been used previously to establish existence of memoryless e- 
optimal strategies, for e > (for example see [13] and also results of Condon [5] for turn-based games). 
However our proof which constructs the memoryless strategies based on value-iteration scheme is new. 
Considering again the value-iteration scheme (1), since ((l)) va | (Reach (T)) = lim^oo-Ufc, for every e > 
there is a k such that Uk{s) > Uk-i(s) > ((l)) va ](Reach(T))(s) — e at all states s G S. Lemma 3 indicates 
that, in order to construct a memoryless e-optimal strategy, we need to construct from Uk-i a player-1 
selector £1 such that: 

1. £1 is value-optimal for Uk-i, that is, Prei-^ (uk-i) = Pre\(uk-i) = and 

2. £1 is proper. 

To ensure the construction of a value-optimal, proper selector, we need some definitions. For r > 0, the 

value class 

U^ = {seS\ Ufc (a) = r} 

consists of the states with value r under the valuation u^. Similarly we define U^ r = {s G S \ Uk(s) txi r}, 
for 1x1 G {<, <, >, >}. For a state s 6 S, let £k(s) = min{j < k | Uj(s) = Uk{s)} be the entry time of s in 
U k , n, that is, the least iteration j in which the state s has the same value as in iteration k. For k > 0, we 
define the player-1 selector rjk as follows: if £k(s) > 0, then 

Vk(s) = m k (s)(s) = arg max inf Pre^iu^^); 

otherwise, if tk{s) = 0, then r]j,(s) = Ve k (s)( s ) = Ci"^( s ) ( tn i s definition is arbitrary, and it does not affect 
the remainder of the proof). In words, the selector %(s) is an optimal selector for s at the iteration £k(s). It 
follows easily that Uj, = Prei :Vk (uk-i), that is, r/k is also value-optimal for Uk-i, satisfying the first of the 
above conditions. 

To conclude the construction, we need to prove that for k sufficiently large (namely, for k such that 
itfc(s) > at all states s G S \ (T U W2)), the selector rjk is proper. To this end we use Theorem 1, and 
show that for sufficiently large k no end component of G rik is entirely contained in S \ (T U W2 ) . 1 To reason 
about the end components of G Vk , for a state s <E S and a player-2 move 02 G ^(s), we write 

Destk(s,a2)= [^J Dest(s,ai,ci2) 

a!GSupp(r] k (s)) 

'in fact, the result holds for all k, even though our proof, for the sake of a simpler argument, does not show it. 
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for the set of possible successors of state s when player 1 follows the strategy rj k , and player 2 chooses the 
move a,2- 

Lemma 4 Let < r < 1 and k > 0, and consider a state s G S \ (T U W2) such that s G For a// 
moves d2 G ^(s), we Ziave: 

1. either Dest k (s, a 2 ) D XJ> r / 0, 

2. or Destk(s, 02) C [7^, araci there is a state t G Destk(s, 02) wifft ^fc(t) < £k{ s )- 
Proof. For convenience, let m = £k(s), and consider any move a>2 G ^(s). 

• Consider first the case that Destk(s, 02) 2 Then, it cannot be that Destk{s, 02) C [/<,,; other- 
wise, for all states t G Dest k (s, 02), we would have lifc(t) < r, and there would be at least one state 
t G Destk(s, 02) such that Uk(t) < r, contradicting Uj,(s) = r and Prei-. Vk (uk-i) = u k . So, it must 
be that Dest k (s, a 2 ) n U^ r / 0. 

• Consider now the case that Dest k (s, 02) ^ • Since w m < w^, due to the monotonicity of the Pre\ 
operator and (1), we have that u m _i(i) < r for all states t G Destk(s,a>2). From r = Uk(s) = 
u m {s) = Prei :rjk (u m -i), it follows that u m _i(i) = r for all states i G Dest k (s,a,2), implying that 
£ k (t) < m for all states t G Destk(s, 02). I 

The above lemma states that under from each state i G with r > we are guaranteed a probability 
bounded away from of either moving to a higher- value class U> r , or of moving to states within the value 
class that have a strictly lower entry time. Note that the states in the target set T are all in U®: they have 
entry-time in the value class for value 1 . This implies that every state in S \ W2 has a probability bounded 
above zero of reaching T in at most n = \S\ steps, so that the probability of staying forever in S \ (T U W2) 
is 0. To prove this fact formally, we analyze the end components of G Vk in light of Lemma 4. 

Lemma 5 For all k > 0, if for all states s G S\ W2 we have lifc-i(s) > 0, then for all player-2 strategies 
7T 2) we have Pr^' 712 (Reach(T U W 2 )) = 1. 

Proof. Since every state s G (T U W2) is absorbing, to prove this result, in view of Corollary 1, it suffices 
to show that no end component of G Vk is entirely contained in S \ (T U W2). Towards the contradiction, 
assume there is such an end component C C S \ (T U W2). Then, we have C C UP r i with C n U r , 2 7^ 0, 
for some < r\ < t2 < 1, where , = E/> n n f7< r2 is the union of the value classes for all values in 
the interval [n, r2]. Consider a state s G with minimal that is, such that l\-{s) < ^fc(i) for all other 
states t G U^ 2 - From Lemma 4, it follows that for every move 02 G ^(s), there is a state t G Destk(s, 02) 
such that (i) either t G U^ 2 and ^(t) < £k(s), (ii) or i G f7> r2 . In both cases, we obtain a contradiction. I 

The above lemma shows that % satisfies both requirements for optimal selectors spelt out at the begin- 
ning of Section 4.2. Hence, % guarantees the value u^. This proves the existence of memoryless e-optimal 
strategies for concurrent reachability games. 

Theorem 2 (MEMORYLESS e-OPTlMAL STRATEGIES). For every e > 0, memoryless e-optimal strategies 
exist for all concurrent games with reachability objectives. 
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Proof. Consider a concurrent reachability game with target T ^ S . Since lim^—).^ — 
((l)) va i(Reach(T)), for every e > we can find k G N such that the following two assertions hold: 

max(((l)) va |(Reach(T))(s) - u fc _i(s)) < e 
min Uh-i (s) > 

By construction, Prei; Vk (uk-i) = Prei(uk-i) = u^. Hence, from Lemma 3 and Lemma 5, for all player-2 
strategies 7T2, we have Pr^'^ 2 (Reach (T)) > leading to the result. I 



5 Strategy Improvement Algorithm for Concurrent Reachability Games 

In the previous section, we provided a proof of the existence of memoryless e-optimal strategies for all 
e > 0, on the basis of a value-iteration scheme. In this section we present a strategy-improvement algorithm 
for concurrent games with reachability objectives. The algorithm will produce a sequence of selectors 
7o> lii 72, ■ • • for player 1, such that: 

1. for all i > 0, we have ((l))J*,(Reach(T)) < ((l))^ 1 (Reach (T)); 

2. if there is i > such that y L = -y i+1 , then ((1))^, (Reach (T)) = ((l)) va i(Reach(T)); and 

3. lim^ 00 ((l))^ | (Reach(T)) = ((l)) va ,(Reach(r)). 

Condition 1 guarantees that the algorithm computes a sequence of monotonically improving selectors. Con- 
dition 2 guarantees that if a selector cannot be improved, then it is optimal. Condition 3 guarantees that the 
value guaranteed by the selectors converges to the value of the game, or equivalently, that for all e > 0, 
there is a number i of iterations such that the memoryless player- 1 strategy 7 y i is e-optimal. Note that for 
concurrent reachability games, there may be no i > such that y L = 7j+i, that is, the algorithm may fail 
to generate an optimal selector. This is because there are concurrent reachability games that do not admit 
optimal strategies, but only e-optimal strategies for all e > [14, 10]. For turn-based reachability games, 
our algorithm terminates with an optimal selector and we will present bounds for termination. 

We note that the value-iteration scheme of the previous section does not directly yield a strategy- 
improvement algorithm. In fact, the sequence of player- 1 selectors 770, 771, 772, • • ■ computed in Section 4.1 
may violate Condition 2: it is possible that for some i > we have rji = rji+i, but rji / rjj for some j > i. 
This is because the scheme of Section 4. 1 is fundamentally a value-iteration scheme, even though a selector 
is extracted from each valuation. The scheme guarantees that the valuations no, u±, 112, ■ ■ ■ defined as in (1) 
converge, but it does not guarantee that the selectors 770, 771, 772, • • • improve at each iteration. 

The strategy-improvement algorithm presented here shares an important connection with the proof of 
the existence of memoryless e-optimal strategies presented in the previous section. Here, also, the key is 
to ensure that all generated selectors are proper. Again, this is ensured by modifying the selectors, at each 
iteration, only where they can be improved. 

5.1 The strategy-improvement algorithm 

Ordering of strategies. We let W2 be as in Section 4.1, and again we assume without loss of gener- 
ality that all states in W2 U T are absorbing. We define a preorder -< on the strategies for player 1 as 
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Algorithm 1 Reachability Strategy-Improvement Algorithm 



Input: a concurrent game structure G with target set T. 
Output: a strategy 7 for player 1 . 

0. Compute W 2 = {s € S | ((l)) va |(Reach(T))(s) = 0}. 

1 . Let 70 = an d i = 0. 

2. Compute v = «1» (Reach (T)). 

3. do{ 

3.1. Let / = {s G S \ (T U W 2 ) | Prei(^)(s) > ^(s)}. 

3.2. Let £x be a player-1 selector such that for all states s £ I, 
we have Prei-^ 1 (vi)(s) = Pre\(vi)(s) > Vi(s). 

3.3. The player-1 selector 7^+1 is defined as follows: for each state s G S, let 

ji(s) if s g I; 
&(«) ifsG/. 



7*+i(») 



3.4. Compute = ((l^+^ReacKT)). 

3.5. Let i = i + l. 
} until 7 = 0. 

4. return 7j. 



follows: given two player 1 strategies 717 and tt[, let 717 -< tt[ if the following two conditions hold: 
(i) ((lOReach(T)) < ((l))5,(Reach(T)); and (ii) ((l))^(Reach(T))( S ) < ((l))^(Reach(T))( S ) for 
some state s 6 S. Furthermore, we write 717 ^ ti' x if either 717 < -k[ or 77 = 7r^. 

Informal description of Algorithm 1. We now present the strategy-improvement algorithm (Algorithm 1) 
for computing the values for all states in S\ (TU W2). The algorithm iteratively improves player-1 strategies 
according to the preorder -<. The algorithm starts with the random selector 70 = f . At iteration i + 1, 
the algorithm considers the memoryless player-1 strategy 7 y i and computes the value ((l))J:j|(Reach(T)). 
Observe that since 7 y i is a memoryless strategy, the computation of ((l))J*!(Reach(T)) involves solving the 
2-MDP G 7i . The valuation ((l))J*|(Reach(T)) is named vi. For all states s such that Pre\(vi)(s) > Vi(s), 
the memoryless strategy at s is modified to a selector that is value-optimal for V{. The algorithm then 
proceeds to the next iteration. If Prei(vi) = Vi, the algorithm stops and returns the optimal memoryless 
strategy 7^ for player 1. Unlike strategy-improvement algorithms for turn-based games (see [6] for a survey), 
Algorithm 1 is not guaranteed to terminate, because the value of a reachability game may not be rational. 

5.2 Convergence 

Lemma 6 Let 73 and 7^+1 be the player-1 selectors obtained at iterations i and i + 1 of Algorithm 1. If^/i 
is proper, then 7^+1 is also proper. 

Proof. Assume towards a contradiction that 73 is proper and 7^+1 is not. Let £2 be a pure selector for 
player 2 to witness that 7^+1 is not proper. Then there exist a subset C C S \ (T U W 2 ) such that C is a 
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closed recurrent set of states in the Markov chain G li+1 ^ 2 . Let I be the nonempty set of states where the 
selector is modified to obtain 73+1 from 7,; at all other states 7, and 7^+1 agree. 

Since 7, and 73+1 agree at all states other than the_states in /, and 7, is a proper strategy, it follows 
that C n / / 0. Let t£ = {s G 5 \ (T U W 2 ) | ((l))^,(Reach(T))(s) = Vi (s) = r} be the value class 
with value r at iteration i. For a state s £ {J* the following assertion holds: if Dest(s, 7$, £2) £ tnen 
Dest(s, 7i, £2) H 7^ 0. Let 2 = max{r | J7* n (7 7^ 0}, that is, ?7* is the greatest value class at iteration 
i with a nonempty intersection with the closed recurrent set C. It easily follows that < z < 1. Consider 
any state s G I, and let s G J7*. Since Pre\(vi)(s) > Vi(s), it follows that Dest(s, 7«+i, £2) H £7> 3 / 0. 
Hence we must have z > q, and therefore I n C n {/* = 0. Thus, for all states s G !7J fl C, we have 
7i(s) = 7j+i(s). Recall that z is the greatest value class at iteration i with a nonempty intersection with C; 
hence U i >z n C = 0. Thus for all states s G C n f7*, we have Dest(s, 7i+i, 6) C [/* fl C. It follows that 
(7 C However, this gives us three statements that together form a contradiction: C fl I 7^ (or else 7, 
would not have been proper), I n C n XJ\ = 0, and C C I 

Lemma 7 For a// i > 0, player-1 selector 7$ obtained at iteration i of Algorithm 1 is proper. 
Proof. By Lemma 2 we have that 70 is proper. The result then follows from Lemma 6 and induction. I 

Lemma 8 Let 73 a«<i 73+1 Z?e f/ie player-1 selectors obtained at iterations i and i + 1 of Algorithm 1. 



Let I = {s G 5 I Prei(vi)(s) > ^(s)}. Lef Vi = ({l))^{Reach(T)) and v i+1 = ((l))^ 1 (Reach(T)). 



Then Vi + ±(s) > Pre\{vi)(s) for all states s G SV and therefore Vi + i(s) > v^s) for all states s G S, and 
Vi + ±(s) > Vi(s) for all states s G I. 

Proof. Consider the valuations v\ and Vi+i obtained at iterations i and i + 1, respectively, and let u>j be the 
valuation defined by itfj(s) = 1 — Vi(s) for all states s G S. Since 73+1 is proper (by Lemma 7), it follows 
that the counter-optimal strategy for player 2 to minimize v^+i is obtained by maximizing the probability to 
reach W2. In fact, there are no end components in S \ (W2 U T) in the 2-MDP G~ h+1 . Let 



In other words, Wi = 1 — Pre±(vi), and we also have Wi < Wi. We now show that Wi is a feasible 
solution to the linear program for MDPs with the objective Reach(W / 2)» as described in Section 3. Since 
Vi = (Reach (T)), it follows that for all states s G S and all moves 02 G ^(s), we have 



tes 

For all states s£S\/,we have ji(s) = 73+1(5) and Wi(s) = Wi(s), and since Wi < wu it follows that for 
all states s G S \ I and all moves 02 G ^(s), we have 



tes 

Since for s G I the selector 7 l+ i(s) is obtained as an optimal selector for Pre\{vi){s), it follows that 
for all states s G I and all moves 02 G ^(s), we have 







Pre. 



'7i+l>«2 



{vi){s) > Pre^v^s); 
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in other words, 1 — Prei(vi)(s) > 1 — Pre Ji+1:a , 2 (vi)(s). Hence for all states s G I and all moves 
a,2 G ^(s), we have 

Wi(s) > ^2wi(t) ■ 6 Ji+1 (s,a 2 ). 
tes 

Since Wi < Wi, for all states s £ I and all moves a>2 G ^(s), we have 

Wi(s) > y^Wj(t) ■ <5 7i+1 (s,a 2 ) ( for s G I). 

Hence it follows that Wi is a feasible solution to the linear program for MDPs with reachability objectives. 
Since the reachability valuation for player 2 for Reach (W2) is the least solution (observe that the objective 
function of the linear program is a minimizing function), it follows that > 1 — wi = Pre\(vi). Thus 
we obtain Vi+i(s) > Vi(s) for all states s G S, and Vi+i(s) > Vi(s) for all states s G L I 

Theorem 3 (STRATEGY IMPROVEMENT). The following two assertions hold about Algorithm 1: 

1. For all i > 0, we have 7, ^ 7j+i>' moreover, if 7 y i = 7j +1 , ?/ze« 7j « a« optimal strategy. 

2. limj^ooUj = limj-j.oop))^! (Reach(T)) = ((l)) va \(Reach(T)). 
Proof. We prove the two parts as follows. 

1. The assertion that 7, ^ 7 i+1 follows from Lemma 8. If 7 f = 7 i+1 , then Pre\(vi) = V{. Let 
f = ((l)) va |(Reach(T)), and since v is the least solution to satisfy Pre±(x) = x (i.e., the least 
fixpoint) [11], it follows that vi > v. From Lemma 7 it follows that 7^ is proper. Since 7^ is proper by 
Lemma 3, we have ((l))^,(Reach(r)) > v { > v. It follows that 7j is optimal for player 1. 

2. Let vq = [T] and uq = [T]. We have uq < vo. For all k > 0, by Lemma 8, we have v^ + \ > 
[T] V Prei(vk). For all k > 0, let u^+i = [T] V Pre\(uk). By induction we conclude that for all 
k > 0, we have < f^. Moreover, < ((l)) va | (Reach (T)), that is, for all k > 0, we have 

u k <v k < ((l)) va ,(Reach(T)). 

Since limfc^oo = ((l)) va | (Reach (T)), it follows that 

lim «l»^(Reach(T)) = lim v k = «l)) val (Reach(T)). 

k — ^00 fc— s-oo 

The theorem follows. I 

5.3 Termination for turn-based stochastic games 

If the input game structure to Algorithm 1 is a turn-based stochastic game structure, then if we start with a 
proper selector 70 that is pure, then for all z > we can choose the selector 7^ such that ji is both proper and 
pure: the above claim follows since given a valuation v, if a state s is a player 1 state, then there is an action a 
at s (or choice of an edge at s) that achieves Pre\ (v) (s) at s. Since the number of pure selectors is bounded, 
if we start with a pure, proper selector then termination is ensured. Hence we present a procedure to compute 
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a pure, proper selector, and then present termination bounds (i.e., bounds on i such that Uj+i = Ui). The 
construction of a pure, proper selector is based on the notion of attractors defined below. 

Attractor strategy. Let Aq = W2 U T, and for i > we have 

A i+1 = Ai U {s g Si U S R I £(s) n / 0} u {s g S 2 \ E(s) c Aj}. 

Since for all s G 5 \ W 2 we have ((l)) va | (Reach (T)) > 0, it follows that from all states in S \ W 2 player 1 
can ensure that T is reached with positive probability. It follows that for some i > we have A\ = S. The 
pure attractor selector £* is as follows: for a state s G (Ai + \ \ Ai) n Si we have £*(s)(i) = 1, where t G A, 
(such a t exists by construction). The pure memoryless strategy £* ensures that for all i > 0, from Aj+i the 
game reaches Aj with positive probability. Hence there is no end-component C contained in S \ (W2 U T) 
in the MDP G^. It follows that is a pure selector that is proper, and the selector £* can be computed in 
0(|£7|) time. We now present the termination bounds. 

Termination bounds. We present termination bounds for binary turn-based stochastic games. A turn-based 
stochastic game is binary if for all s G Sr we have \E(s)\ < 2, and for all s G Sr if \E(s)\ = 2, then for 
all t G E(s) we have S(s)(t) = \, i.e., for all probabilistic states there are at most two successors and the 
transition function 5 is uniform. 

Lemma 9 Let G be a binary Markov chain with \S\ states with a reachability objective ReachiT). Then 
for all s G S we have ((l)) va \(Reach(T)) = 2 withp,q G Nandp,q < i^^ 1 . 

Proof. The results follow as a special case of Lemma 2 of [6]. Lemma 2 of [6] holds for halting turn- 
based stochastic games, and since Markov chains reaches the set of closed connected recurrent states with 
probability 1 from all states the result follows. I 

Lemma 10 Let G be a binary turn-based stochastic game with a reachability objective ReachiT). Then 
for all s G S we have ((l)) va \(Reach(T)) = |, withp,q G Nandp,q < 4l 5 «l _1 . 

Proof. Since pure memoryless optimal strategies exist for both players (existence of pure memoryless 
optimal strategies for both players in turn-based stochastic reachability games follows from [5]), we fix pure 
memoryless optimal strategies tt\ and 1T2 for both players. The Markov chain G wi:7r2 can be then reduced 
to an equivalent Markov chains with \Sr\ states (since we fix deterministic successors for states in Si U S2, 
they can be collapsed to their successors). The result then follows from Lemma 9. I 

From Lemma 10 it follows that at iteration i of the reachability strategy improvement algorithm either 
the sum of the values either increases by 4|S ^_ 1 or else there is a valuation U{ such that Uj + i = u^. Since 

the sum of values of all states can be at most it follows that algorithm terminates in at most \S\ • 4l 5fl l~ 1 
iterations. Moreover, since the number of pure memoryless strategies is at most I^eSi l-^( s )l' th e algorithm 
terminates in at most ri^eSx iterations. It follows from the results of [28] that a turn-based stochastic 

game structure G can be reduced to a equivalent binary turn-based stochastic game structure G' such that 
the set of player 1 and player 2 states in G and G' are the same and the number of probabilistic states in G' 
is 0(|<5|), where \5\ is the size of the transition function in G. Thus we obtain the following result. 

Theorem 4 Let G be a turn-based stochastic game with a reachability objective Reach(T), then the reach- 
ability strategy improvement algorithm computes the values in time 

0(mm{H\E(s)\,2°^}.poly(\G\); 
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where poly is polynomial function. 

The results of [16] presented an algorithm for turn-based stochastic games that works in time 0(\Sr\1 ■ 
poly{\G\)). The algorithm of [16] works only for turn-based stochastic games, for general turn-based 
stochastic games the complexity of the algorithm of [16] is better. However, for turn-based stochastic games 
where the transition function at all states can be expressed with constantly many bits we have | S\ = O ( | Sr | ) . 
In these cases the reachability strategy improvement algorithm (that works for both concurrent and turn- 
based stochastic games) works in time 2°^ Sr ^ -poly(\G\) as compared to the time 2 ^ Sr \ Ao ^\ Sr ^ -poly(\G\) 
of the algorithm of [16]. 

6 Existence of Memoryless Optimal Strategies for Concurrent Safety 
Games 

A proof of the existence of memoryless optimal strategies for safety games can be found in [11]: the proof 
uses results on martingales to obtain the result. For sake of completeness we present (an alternative) proof 
of the result: the proof we present is similar in spirit with the other proofs in this paper and uses the results 
on MDPs to obtain the result. The proof is very similar to the proof presented in [13]. 

Theorem 5 (MEMORYLESS OPTIMAL STRATEGIES). Memoryless optimal strategies exist for all concur- 
rent games with safety objectives. 

Proof. Consider a concurrent game structure G with an safety objective Safe(F) for player 1. Then it 
follows from the results of [1 1] that 



where [F] is the indicator function of the set F and v denotes the greatest fixpoint. Let T = S\F, and for 
all states s G T we have ((l)) va |(Safe(F))(s) = 0, and hence any memoryless strategy from T is an optimal 
strategy. Thus without loss of generality we assume all states in T are absorbing. Let v = ((l)) va |(Safe(F)), 
and since we assume all states in T are absorbing it follows that Pre\(v) = v (since v is a fixpoint). Let 7 
be a player 1 selector such that for all states s we have Pre\- 1 {v)(s) = Pre\{v){s) = v(s). We show that 7 
is an memoryless optimal strategy. Consider the player-2 MDP G 7 and we consider the maximal probability 
for player 2 to reach the target set T. Consider the valuation w defined as w = 1 — v . For all states s G T 
we have w{s) = 1. Since Prei :1 (v ) = Pre\{v) it follows that for all states s G F and all (12 € ^(s) we 
have 

Pre 1A2 (v)(s) > Pre 1 (v)(s) = v(s); 

in other words, for all s 6 F we have 1 — Prei(v)(s) = 1 — v(s) > 1 — Pre~ ua2 (v )(s). Hence for all states 
s e F and all moves 02 G ^(s), we have 



Hence it follows that w is a feasible solution to the linear program for MDPs with reachability objectives, 
i.e., given the memoryless strategy 7 for player 1 the maximal probability valuation for player 2 to reach T 
is at most w. Hence the memoryless strategy 7 ensures that the probability valaution for player 1 to stay 
safe in F against all player 2 strategies is at least v = ((l)) va |(Safe(F)). Optimality of 7 follows. I 



«l» va ,(Safe(F)) = uX.(mm{[F],Pr ei (X)}), 
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7 Strategy Improvement Algorithm for Concurrent Safety Games 



In this section we present a strategy improvement algorithm for concurrent games with safety objectives. 
We consider a concurrent game structure with a safe set F, i.e., the objective for player 1 is Safe(F). The 
algorithm will produce a sequence of selectors 70 , 71 , 72 , • • • for player 1 , such that Condition 1 , Condition 2 
and Condition 3 of Section 5 are satisfied. Note that for concurrent safety games, there may be no i > 
such that 7j = ji+i, that is, the algorithm may fail to generate an optimal selector, as the value can be 
irrational [11]. We start with a few notations 

Optimal selectors. Given a valuation v and a state s, we define by 

OptSel(u,a) = {6 G Ai(s) | Pre 1: ^(v)(s) = Prei(u)(s)} 

the set of optimal selectors for v at state s. For an optimal selector £1 G OptSel(t> , s), we define the set of 
counter-optimal actions as follows: 

CountOpt(t;, s, £1) = {b G F 2 (s) | Pre^ b (v)(s) = Prei(v)(s)}. 

Observe that for £1 G OptSel(t>, s), for all b G T 2 (s) \ CountOpt(w, s, £1) we have Pre^^v)^) > 
Prei(v)(s). We define the set of optimal selector support and the counter-optimal action set as follows: 

OptSelCount(t;, s) = {(A, B) C T^s) x T 2 {s) \ G Ai(s). 6 G OptSel(w, s) 

A Supp(£,i) = A A CountOpt(t;, s,£i) = S}; 

i.e., it consists of pairs (A, B) of actions of player 1 and player 2, such that there is an optimal selector £1 
with support A, and B is the set of counter-optimal actions to £1. 

Turn-based reduction. Given a concurrent game G = (S, M, T\, T 2 , 5} and a valuation v we construct a 
turn-based stochastic game G v = ((S, E), (Si,S 2 , Sr),S) as follows: 

1. The set of states is as follows: 

S = S U {(s, A,B) \ s e S, (A, B) G OptSelCount(f , s)} 
U {(s,A,b) I s G S, (A, B) G OptSelCount(w, s), b G -B}. 

2. The state space partition is as follows: S\ = S; S 2 = {(s,A,B) \ s G S, (A, B) G 
OptSelCount(i;, s)}; and S R = {(s,A, b) \ s G S, (A, B) G OptSelCount(t>, s), b G B}. In other 
words, (Si,S 2 , Sr) is a partition of the state space, where S± are player 1 states, S2 are player 2 
states, and Sr are random or probabilistic states. 

3. The set of edges is as follows: 

E = {(s, (s, A, B)) \ s £ S, (A, B) G OptSelCount(t>, s)} 

U {({s, A, B), (s, A, b)) I b G B} U {{(s, A, b),t) \ t G [j Dest(s, a, b)}. 

aeA 

4. The transition function 5 for all states in Sr is uniform over its successors. 
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Intuitively, the reduction is as follows. Given the valuation v, state s is a player 1 state where player 1 can 
select a pair (A,B) (and move to state (s,A, B)) with A C T\{s) and B C ^(s) such that there is an 
optimal selector £1 with support exactly A and the set of counter-optimal actions to is the set B. From a 
player 2 state (s, A, B), player 2 can choose any action b from the set B, and move to state (s, A, b). A state 
(s, A, b) is a probabilistic state where all the states in IJaeA Dest(s, a, b) are chosen uniformly at random. 
Given a set F C 5 we denote by F = F U {(a, A, B) G 5 | s G F} U {(s, A, b) G S | s € F}. We refer to 
the above reduction as TB, i.e., (G v , F) = TB(G, v, F). 

Value-class of a valuation. Given a valuation v and a real < r < 1, the value-class U r (v) of value r is 
the set of states with valuation r, i.e., U r (v) = {s G S \ v(s) = r} 

7.1 The strategy-improvement algorithm 

Ordering of strategies. Let G be a concurrent game and F be the set of safe states. Let T = S \ F. 
Given a concurrent game structure G with a safety objective Safe(F), the set of almost-sure winning states 
is the set of states s such that the value at s is 1, i.e., W\ = {s G S | ((l)) va ](Safe(F)) = 1} is the set 
of almost-sure winning states. An optimal strategy from W\ is referred as an almost-sure winning strategy. 
The set W\ and an almost-sure winning strategy can be computed in linear time by the algorithm given 
in [9]. We assume without loss of generality that all states in W\ U T are absorbing. We recall the preorder 
-< on the strategies for player 1 (as defined in Section 5.1) as follows: given two player 1 strategies m 
and tt[, let tti -< ir[ if the following two conditions hold: (i) ((l))^ | (Safe(F)) < ((l))^,(Safe(F)); and 

(ii) ((l))^,(Safe(F))(s) < ((l))^,(Safe(F))(s) for some state s G S. Furthermore, we write tti r< ir[ if 
either ir\ -< Tr[ or 7Ti = ir[. We first present an example that shows the improvements based only on Pre\ 
operators are not sufficient for safety games, even on turn-based games and then present our algorithm. 

Example 2 Consider the turn-based stochastic game shown in Fig 2, where the □ states are player 1 states, 
the O states are player 2 states, and O states are random states with probabilities labeled on edges. The 
safety goal is to avoid the state S4. Consider a memory less strategy tt\ for player 1 that chooses the successor 
so — > S2, and the counter-strategy 7T2 for player 2 chooses s\ — > sq. Given the strategies 7Ti and 7T2, the 
value at so, s\ and S2 is 1/3, and since all successors of so have value 1/3, the value cannot be improved by 
Pre\. However, note that if player 2 is restricted to choose only value optimal selectors for the value 1/3, 
then player 1 can switch to the strategy so — > s± and ensure that the game stays in the value class 1/3 with 
probability 1. Hence switching to sq — > s\ would force player 2 to select a counter-strategy that switches to 
the strategy s\ — > S3, and thus player 1 can get a value 2/3. I 

Informal description of Algorithm 2. We first present the basic strategy improvement algorithm (Algo- 
rithm 2) and will later present a convergent version (Algorithm 4) for computing the values for all states in 
S \ W\. The algorithm (Algorithm 2) iteratively improves player- 1 strategies according to the preorder -<. 
The algorithm starts with the random selector 70 = £" m that plays at all states all actions uniformly at ran- 
dom. At iteration i + 1, the algorithm considers the memory less player- 1 strategy 7^ and computes the value 
((l))^(Safe(F)). Observe that since 7^ is a memoryless strategy, the computation of ((l))J*|(Safe(F)) in- 
volves solving the 2-MDP G 7i . The valuation ((l))J\(Safe(F)) is named Vi. For all states s such that 
Pre\{vi)(s) > Vi(s), the memoryless strategy at s is modified to a selector that is value-optimal for V{. 
The algorithm then proceeds to the next iteration. If Pre\(vi) = Vi, then the algorithm constructs the game 
(G Vi ,F) = TB(G, Vi,F), and computes Ai as the set of almost-sure winning states in G Vi for the objective 
Safe(F). Let U = (Ai D S) \ W\. If U is non-empty, then a selector 7^+1 is obtained at U from an pure 
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Figure 2: A turn-based stochastic safety game. 



memoryless optimal strategy (i.e., an almost-sure winning strategy) in G Vi , and the algorithm proceeds to 
iteration i + 1. If Pre±(vi) = and U is empty, then the algorithm stops and returns the memoryless 
strategy 7 y i for player 1 . Unlike strategy improvement algorithms for turn-based games (see [6] for a sur- 
vey), Algorithm 2 is not guaranteed to terminate (see Example 3). We will show that Algorithm 2 has both 
the monotonicity and optimality on termination properties, however, as we will illustrate in Example 3, the 
valuations of Algorithm 2 need not necessarily converge to the values. However, for turn-based stochastic 
games Algorithm 2 correctly converges to the values. We will show that Algorithm 4 has all the desired 
properties (i.e., monotonicity, optimality on termination, and convergence to the values). 

Lemma 11 Let 7$ and 7^+1 be the player-1 selectors obtained at iterations i and i+1 of Algorithm 2. Let 



I = {s G S\ (W 1 UT) I Pr ei ( Vi ){s) > Vi(s)}. Let v t = ((l))^{Safe(F)) andv i+1 = ((l))^ 1 (Safe(F)). 



Then Vi + ±(s) > Pre\{vi)(s) for all states s G S; and therefore Vi + i(s) > v^s) for all states s G S, and 
Vi + \(s) > Vi(s) for all states s G I. 

Proof. The proof is essentially similar to the proof of Lemma 8, and we present the details for completeness. 
Consider the valuations Vi and Vi + \ obtained at iterations % and i + 1, respectively, and let Wi be the valuation 
defined by Wi(s) = 1 — Vi(s) for all states s G S. The counter-optimal strategy for player 2 to minimize 
Vi+i is obtained by maximizing the probability to reach T. Let 



In other words, Wi = 1 — Pre\{vi), and we also have Wi < Wi. We now show that Wi is a feasible 
solution to the linear program for MDPs with the objective Reach (T), as described in Section 3. Since 
V{ = ((l))J*|(Safe(F)), it follows that for all states s G S and all moves 02 G ^(s), we have 



tes 

For all states s G S \ I, we have ji(s) = 7^+1 (s) and Wi(s) = Wi(s), and since Wi < Wi, it follows that for 
all states s G S \ I and all moves 02 G ^(s), we have 






( for s G S \ I). 



tes 
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Algorithm 2 Safety Strategy-Improvement Algorithm 



7i+i(«) 



Input: a concurrent game structure G with safe set F. 
Output: a strategy 7 for player 1 . 

0. Compute Wi = {s G S | ((l)) va i(Safe(F))(s) = 1}. 

1 . Let 70 = an d i = 0. 

2. Compute v = <<l>>J°(Safe(F)). 

3. do{ 

3.1. Let I = {s € S \ {Wi U T) | Prei(^)(s) > ^(s)}. 

3.2 if J 7^ 0, then 

3.2.1 Let £1 be a player- 1 selector such that for all states s G -f, 
we have Prei : ^ 1 (wj)(s) = Prei(fj)(s) > Uj(s). 

3.2.2 The player-1 selector is defined as follows: for each state s G 5, let 

7i(s) if s £ J; 
£i(s) ifa€/. 

3.3 else 

3.3.1 \et(G Vi ,F) = TB(G, Vi ,F) 

3.3.2 let Ai be the set of almost-sure winning states in G Vi for Safe(F) and 
T\ be a pure memory less almost-sure winning strategy from the set A; L . 

3.3.3 if ((AiH 5) \Wi / 0) 

3.3.3.1 let 17 = (^D5)\Wi 

3.3.3.2 The player-1 selector 7^+1 is defined as follows: for s G <S, let 

'7i(s) if s £ [/; 

£i(s) if a G C/,£i(s) G 0ptSel(ui,5),5«pp(^i(s)) = A, 
7fi(s) = (s, A, B),B = CountOpt(s, v, £1). 

3.4. Compute v i+1 = ((l))^ 1 (Safe(F)). 

3.5. Let z = i + 1. 

} until / = and n 5) \ W 1 = 0. 

4. return 7^. 



7*+i(») 



Since for s G 7 the selector 7^+1 (s) is obtained as an optimal selector for Pre\{vi)(s), it follows that 
for all states s G 7 and all moves 02 G r 2 (s), we have 

(vi)(s) > Precis); 

in other words, 1 — Pre\{vi){s) > 1 — Pre li+lt a 2 (vi)(s). Hence for all states s G 7 and all moves 
0-2 G ^(s), we have 

Wi(s) > ^2wi(t) ■ 6 Ji+1 (s,a 2 ). 
tes 

Since Wi < for all states s G 7 and all moves 02 G ^(s), we have 

> ^2wi(t) ■ <S 7i+1 (s,a 2 ) ( for s G I). 
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Hence it follows that Wi is a feasible solution to the linear program for MDPs with reachability objectives. 
Since the reachability valuation for player 2 for Reach (T) is the least solution (observe that the objective 
function of the linear program is a minimizing function), it follows that Vi+i > 1 — Wi = Pre\(vi). Thus 
we obtain Vi + \(s) > Vi(s) for all states s G S, and Vi+i(s) > Vi(s) for all states s£/.l 

Recall that by Example 2 it follows that improvement by only step 3.2 is not sufficient to guarantee 
convergence to optimal values. We now present a lemma about the turn-based reduction, and then show that 
step 3.3 also leads to an improvement. Finally, in Theorem 7 we show that if improvements by step 3.2 and 
step 3.3 are not possible, then the optimal value and an optimal strategy is obtained. 

Lemma 12 Let G be a concurrent game with a set F of safe states. Let v be a valuation and consider 
(G v , F) = TB(G, v, F). Let A be the set of almost-sure winning states in G v for the objective Safe(F), 
and let Wi be a pure memoryless almost-sure winning strategy from A in G v . Consider a memoryless 
strategy m in G for states in An S as follows: ifWi(s) = (s, A, B), then ni(s) G OptSel(w, s) such that 
Supp(7ri(s)) = A and CountOpt(w, s, tti(s)) = B. Consider a pure memoryless strategy 1x2 for player 2. 
If for all states s G A D S, we have 7r 2 (s) G CountOpt(w, s, vri(s)), then for all s G An S, we have 
Pr^ 1 '^ 2 (Safe (F)) = 1. 

Proof. We analyze the Markov chain arising after the player fixes the memoryless strategies m and 
7T2- Given the strategy 7r 2 consider the strategy 7f 2 as follows: if 7fi(s) = (s,A,B) and vr 2 (s) = b G 
CountOpt(u, s, 7Ti(s)), then at state (s,A,B) choose the successor (s, A, b). Since T\ is an almost-sure 
winning strategy for Safe(F), it follows that in the Markov chain obtained by fixing T\ and 7T2 in G v , all 
closed connected recurrent set of states that intersect with A are contained in A, and from all states of A 
the closed connected recurrent set of states within A are reached with probability 1. It follows that in the 
Markov chain obtained from fixing ir\ and 112 in G all closed connected recurrent set of states that intersect 
with A n S are contained in A n S, and from all states of A n S the closed connected recurrent set of states 
within A n S are reached with probability 1 . The desired result follows. I 

Lemma 13 Let 7» and 7^+1 be the player- 1 selectors obtained at iterations i and i + 1 of Algorithm 2. Let 
I = {s G S \ {W_ x U T) I Pr ei {v t )(s) > Vi (s)} = 0, and (A n S) \ W x + 0. Let Vi = ((lf v ^(Safe(F)) 

and Vi+i = ((l))^ 1 (Safe(F)). Then Vi + ±(s) > Vi(s) for all states s G S, and Vi + \(s) > Vi(s) for some 
state s G (A\ n S) \ W\. 

Proof. We first show that v i+1 > v*. Let U = (A\ nS)\Wi. Let Wi(s) = 1 - Vi(s) for all states s G S. 
Since Vi = ((l))J*|(Safe(F)), it follows that for all states s G S and all moves a 2 G r 2 (s), we have 



The selector chosen for 7^+1 at s G U satisfies that £i(s) G OptSel(wj, s). It follows that for all states 
seS and all moves a 2 G T 2 (s), we have 



tes 

It follows that the maximal probability with which player 2 can reach T against the strategy is at most 
Wi. It follows that vi(s) < Vi+i(s). 





24 



We now argue that for some state s G U we have Vi+i(s) > Vi(s). Given the strategy 7 i+1 , consider a 
pure memoryless counter-optimal strategy 112 for player 2 to reach T. Since the selectors 7^+1 (s) at states 
s £ U are obtained from the almost-sure strategy W in the turn-based game G Vi to satisfy Safe(i ? ), it follows 
from Lemma 12 that if for every state s G U, the action ^(s) G CountOpt(vj, s, 7«+i), then from all states 
s G U, the game stays safe in F with probability 1. Since 7 i+1 is a given strategy for player 1, and 1T2 is 
counter-optimal against 7 i+1 , this would imply that £7 C {s G 5 | ((l)) va i(Safe(F)) = 1}. This would 
contradict that Wi = {s G S | ((l)) va i(Safe(F)) = 1} and U D Wi = 0. It follows that for some state 
s* G U we have vr 2 (s*) G" CountOpt(t> j, s*,7j + i), and since 7 i+1 (s*) G OptSel(uj, s*) we have 

< 2 Vi (*)' <J 'r*+i( s *' 7r2 ( a *)) ; 

in other words, we have 

Ms*) > J^Wi(t) • (5 7i+1 (s*,7r 2 (s*)). 

Define a valuation 2 as follows: z(s) = Wi(s) for s / s*, and = Yltes Wi (t) " ^+1 ( s *) ^2 («*))■ 

Given the strategy 7 i+1 and the counter-optimal strategy 7r2, the valuation 2 satisfies the inequalities of the 
linear-program for reachability to T. It follows that the probability to reach T given 7 i+1 is at most z. Thus 
we obtain that Vi + \(s) > Vi(s) for all s G S 1 , and Uj + i(s*) > This concludes the proof. I 

We obtain the following theorem from Lemma 1 1 and Lemma 13 that shows that the sequences of values 
we obtain is monotonically non-decreasing. 

Theorem 6 (MONOTONICITY OF VALUES). For i > 0, let ji and 7^+1 be the player-1 selectors obtained 
at iterations i and i + 1 of Algorithm 2. If '7, 7^ 7i+i, ffteH (a) for all s £ S we have ((l))^(Safe(F))(s) < 

{{l))lT{Safe{F)){s); and (b) for some s* G S we have ((lf^(Safe(F))(s*) < ((l))^ 1 (Safe(F))(s*). 

Theorem 7 (Optimality ON TERMINATION). Let Vi be the valuation at iteration i of Algorithm 2 such 
thatvi = ((l))^ ai (Safe(F)). If I = {s G S\(WiUT) | Prei(^)(s) > Vi {s)} = 0, and (Aj n 5) \ Wi = 0, 
7j w an optimal strategy and Vi = {{l)) va \(Safe(F)). 

Proof. We show that for all memoryless strategies tt\ for player 1 we have ((l))^ a 1 ! (Safe(F)) < v\. Since 
memoryless optimal strategies exist for concurrent games with safety objectives (Theorem 5) the desired 
result follows. 

Let 7T2 be a pure memoryless optimal strategy for player 2 in G Vi for the objective complementary to 
Safe(F), where (G Vi , Safe(-F)) = TB(G,Vi,F). Consider a memoryless strategy m for player 1, and we 
define a pure memoryless strategy 7r2 for player 2 as follows. 

1. If vri(s) G" OptSel(wj, s), then ^(s) = b G ^(s), such that Pre 7T1 ^^( K Vi)(s) < Vi(s); (such a b 
exists since vri(s) G" OptSel(t>j, s)). 

2. If vri(s) G OptSel(uj, s), then let A = Supp(ir\(s)), and consider B such that £> = 
CountOpt(wj, s, vri(s)). Then we have iT2(s) = b, such that 7f 2 ((s, A 5)) = (s, A, b). 

Observe that by construction of -K2, for all s G S \ {W\ U T), we have Pre 7T1 ^ s ^ 7T2 ^(vi)(s) < Vi(s). We 
first show that in the Markov chain obtained by fixing m and 1x2 in G, there is no closed connected recurrent 
set of states C such that C C 5 \ (Wi U T). Assume towards contradiction that C is a closed connected 
recurrent set of states in S \ (W\ U T). The following case analysis achieves the contradiction. 
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1. Suppose for every state s G C we have ni(s) G ptSel s). Then consider the strategy 
7fi in G„. such that for a state s G C we have 7f"i(s) = (s, ^4, 5), where 7Ti(s) = A, and 
5 = CountOpt(vj, s, vri(s)). Since C is closed connected recurrent states, it follows by con- 
struction that for all states s G C in the game we have PrJ 1,7r2 (Safe(C)) = 1, where 
C = C U {(s,A,B) | s G C} U {(s,A,6) | s G C}. It follows that for all s G C in G„. we 
have PrJ ll,ri, (Safe(F)) = 1. Since 7f 2 is an optimal strategy, it follows that C C (A i nS)\ W v This 
contradicts that (Ai n S) \ W\ = 0. 

2. Otherwise for some state s* G C we have vri(s*) OptSel^, s*). Letr = min{g | [/^(i^nC / 0}, 
i.e., r is the least value-class with non-empty intersection with C. Hence it follows that for all q < r, 
we have U q (vi)C\C = 0. Observe that since for all s G C we have Pre ni r s \ n2 / S ^(vi)(s) < Vi(s), it fol- 
lows that for all s G U r (vi) either (a) Dest(s, 7Ti(s), 7r2(s)) C U r (vi); or (b) Dest(s, tti(s), ^(s)) fl 
U q {vi) ^ 0, for some q < r. Since U r {vi) is the least value-class with non-empty intersection with C, 
it follows that for all s G U r (v j) we have Dest(s, tti(s), ^(s)) C U r (vi). It follows that C C U r (vi). 
Consider the state s* G C such that vri(s*) g" OptSel(«j, s). By the construction of 7r2(s), we have 
P re ni(s*),ir2(s*)( v i)( s *) < Vi(s*). Hence we must have Dest(s*, 7Ti(s*), vr 2 (s*)) n C/ g (fj) / 0, for 
some q < r. Thus we have a contradiction. 

It follows from above that there is no closed connected recurrent set of states in S \ {W\ U T), and hence 
with probability 1 the game reaches Wi U T from all states in S \ (W\ U T). Hence the probability to 
satisfy Safe(F) is equal to the probability to reach W\. Since for all states s G S \ (W\ U T) we have 
P re TTi(s),TT 2 (s)( v i)( s ) < Vi(s), it follows that given the strategies m and 7T2, the valuation v,- L satisfies all the 
inequalities for linear program to reach W\. It follows that the probability to reach W\ from s is atmost 
Vi(s). It follows that for all s G S \ {W± U T) we have p))£,(Safe(F))(s) < v^s). The result follows. I 

fc-uniform selectors and strategies. For concurrent games, we will use the result that for e > 0, there 
is a k-uniform memoryless strategy that achieves the value of a safety objective within e. We first define 
fc-uniform selectors and /c-uniform memoryless strategies. For a positive integer k > 0, a selector £ for 
player 1 is k-uniform if for all s G S \ (T U W\) and all a G Supp(iri(s)) there exists i, j G N such that 

< i < j < k and £(s)(a) = j, i.e., the moves in the support are played with probability that are multiples 
of | with £ < k. We denote by A k the set of /c-uniform selectors. A memoryless strategy is £;-uniform if it 
is obtained from a /c-uniform selector. We denote by n^' fc the set of /c-uniform memoryless strategies for 
player 1. We first present a technical lemma (Lemma 14) that will be used in the key lemma (Lemma 15) to 
prove the convergence result. 

Lemma 14 Let a\, a,2, ■ ■ ■ , a m be m real numbers such that (1) for all 1 < i < m, we have > 0; and 
(2) YllLi a i = 1- Let c = mmi<i< m Oj. For r] > 0, there exists k > ^ and m real numbers &i, 62, . . . , 6 m 
such that (1) for all 1 < i < m, we have bi is a multiple of ^ and bi > 0; (2) YliLi h = L and (3) for all 

1 < i < m, we have < 1 + ri and ^ < 1 + r>. 

Proof. Let £ = For 1 < i < m, define bi such that bi is a multiple of | and Oj < b~i < ai + | (basically 

define 6j as the least multiple of \ that is at least the value of ai). For 1 < i < m, let 6, = ^ = ; i.e., bi is 

2^=1 b » 

defined from 6, with normalization. Clearly, YliLi bi = 1> and for all 1 < i < m, we have bi > and 6j can 
be expressed as a multiple of p for some k > We have the following inequalities: for all 1 < z < m, 
we have 

1 a>i 
h<ai + -; &j > YqTa ■ 
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The first inequality follows since bi < cii + \ and YaL i h > YliLi a * = 1- The second inequality follows 
since \ > a { and YT=i h < + l) = £™ 1 a i + T = 1 + f • Hence for all 1 < i < m, we have 



&i 1 1 

^<l + ^<l + ??-c<l + r/. 



The desired result follows. 



Lemma 15 For all concurrent game structures G, for all safety objectives Safe(F), for F C S, for all 
e > 0, there exist k > a«ci k-uniform selectors £ smc/j ?/ia? £ is a« e-optimal strategy. 

Proof. Our proof uses a result of Solan [27] and the existence of memoryless optimal strategies for concur- 
rent safety games (Theorem 5). We first present the result of Solan specialized for MDPs with reachability 
objectives. 

The result of [27]. Let G = (S, M, T 2 , 5) and G' = (S, M, T 2 ,S') be two player-2 MDPs defined on the 
same state space S, with the same move set M and the same move assignment function T 2 , but with two 
different transition functions S and 5', respectively. Let 



p(G, G') = max 



5{s,a 2 )(t) 5'(s,a 2 )(t) 



s,te5,a 2 er 2 (s) 1 S'(s, a 2 )(t) ' 5(s,a 2 )(t) J 

where by convention x/0 = +cx) for x > 0, and 0/0 = 1 (compare with equation (9) of [27]: p(G, G') 
is obtained as a specialization of (9) of [27] for MDPs). Let T C S. For s € S, let v(s) and v'(s) denote 
the value for player 2 for the reachability objective Reach(T) from s in G and G ', respectively. Then from 
Theorem 6 of [27] (also see equation (10) of [27]) it follows that 

-4 ■ |S| ■ „(G,0') < , W - „' (s , < (2) 

where x + = max{x, 0}. We first explain how specialization of Theorem 6 of [27] yields (2). Theorem 6 
of [27] was proved for value functions of discounted games with costs, even when the discount factor 
A = 0. Since the value functions of limit-average games are obtained as the limit of the value functions 
of discounted games as the discount factor goes to [23], the result of Theorem 6 of [27] also holds for 
concurrent limit-average games (this was the main result of [27]). Since reachability objectives are special 
case of limit-average objectives, Theorem 6 of [27] also holds for reachability objectives. In the special 
case of reachability objectives with the same target set, the different cost functions used in equation (10) 
of [27] coincide, and the maximum absolute value of the cost is 1. Thus we obtain (2) as a specialization of 
Theorem 6 of [27]. 

We now use the existence of memoryless optimal strategies in concurrent safety games, and (2) to obtain 
our desired result. Consider a concurrent safety game G = (S, M, Ti,F 2 , 5) with safe set F for player 1. 
Let 7Ti be a memoryless optimal strategy for the objective Safe(F). Let c = min se s )Clieri ( s ){7ri(s)(ai) | 
7Ti(s)(ai) > 0} be the minimum positive transition probability given by tt\. Given e > 0, let r] = 
min{^j^| , g^5j}- We define a memoryless strategy i\' x satisfying the following conditions: for s G S and 
a± G Ti(s) we have 
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1. if 7Ti(s)(ai) = 0,thenvri(s)(ai) = 0; 



2. if 7r 1 (s)(ai) > 0, then following conditions are satisfied: 




(a) 7ri(s)(ai) > 0; 



(d) 7r' 1 (s)(ai) is a multiple of |, for an integer k > (such a A; exists for k > ^). 



For k > such a strategy tt[ exists (follows from the construction of Lemma 14). Let G\ and G[ be 
the two player-2 MDPs obtained from G by fixing the memoryless strategies tti and ir[, respectively. Then 
by definition of ir[ we have p(G\,G' 1 ) < r). Let T = S \ F. For s € 5, let the value of player 2 for the 
objective Reach(T) in G\ and G[ be u(s) and respectively. By (2) we have 



Observe that by choice of 77 we have (a) 4 • \S\ ■ rj < an d (b) 2 • \S\ ■ rj < \. Hence we have 
— e < v(s) — v'(s) < e. Since 7Ti is a memoryless optimal strategy, it follows that ir[ is a /c-uniform 
memoryless e-optimal strategy. I 

Turn-based stochastic games convergence. We first observe that since pure memoryless optimal strategies 
exist for turn-based stochastic games with safety objectives (the results follows from results of [5, 22]), for 
turn-based stochastic games it suffices to iterate over pure memoryless selectors. Since the number of pure 
memoryless strategies is finite, it follows for turn-based stochastic games Algorithm 2 always terminates 
and yields an optimal strategy. In other words, we can restrict the selectors used in Algorithm 2 in Steps 
3.2.2 and 3.3.2.2 to be pure memoryless selectors. Then the local improvement steps of Algorithm 2 with 
pure memoryless selectors terminates, and by Theorem 7 yield a globally optimal pure memoryless strategy 
(which is an optimal strategy). We will use the argument for turn-based stochastic games to a variant of 
Algorithm 2 restricted to /c-uniform selectors. 

Strategy improvement with /c-uniform selectors. We now present the variant of Algorithm 2 where we 
restrict the algorithm to /c-uniform selectors. The notations are essentially the same as used in Algorithm 2, 
but restricted to /c-uniform selectors and presented as Algorithm 3. (for example, G v . is similar to G Vi 
but restricted to /c-uniform selectors, and similarly OptSel(fj, s, k) are the optimal A>uniform selectors, see 
Section 8 for complete details). We first argue that if we restrict Algorithm 2 such that every iteration yields 
a fc-uniform selector, for k > 0, then the algorithm terminates, i.e., Algorithm 3 terminates. The basic 
argument that if Algorithm 2 is restricted to /c-uniform selectors for player 1, for k > 0, then the algorithm 
terminates, follows from the facts that (i) the sequence of strategies obtained are monotonic (Theorem 6) 
(i.e., the algorithm does not cycle among /c-uniform selectors); and (ii) the number of /c-uniform selectors 
for a given k is finite. Given k > 0, let us denote by zf the valuation of Algorithm 3 at iteration i. 

Lemma 16 For all k > 0, there exists i > such that z\ = zf +1 . 

Convergence to optimal /c-uniform memoryless strategies. We now argue that the valuation Algorithm 3 
converges to is optimal for /c-uniform selectors. The argument is as follows: if we restrict player 1 to 



-4- \S\ -v < v(s)-v'(s) < 



(l-2.\S\. V )+'> 



4-\S\- V 
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chose between /c-uniform selectors, then a concurrent game structures G can be converted to a turn-based 
stochastic game structure, where player 1 first chooses a /c-uniform selector, then player 2 chooses an action, 
and then the transition is determined by the chosen /c-uniform selector of player 1, the action of player 2 
and the transition function 5 of the game structure G. Then by termination of turn-based stochastic games 
it follows that the algorithm will terminate. It follows from Theorem 7 that upon termination we obtain 
optimal strategy for the turn-based stochastic game. In other words, as discussed above for turn-based 
stochastic game, the local iteration converges to a globally optimal strategy. Hence the valuation obtained 
upon termination is the maximal value obtained over all /c-uniform memoryless strategies. This gives us the 
following lemma (also see appendix for a detailed proof). 

Lemma 17 For all k > 0, let i > be such that z\ = zf +1 . Then we have z\ = 
majK nieIl M, k mU 2eU2 Pv^(Safe(F)). 

Lemma 18 For all concurrent game structures G, for all safety objectives Safe(F), for F C S, for all 
e > 0, there exist k > and i > such that for all s G S we have z^(s) > ((l}) va \(Safe(F))(s) — e. 

Proof. By Lemma 15, for all e > 0, there exists k > such that there is a /c-uniform memoryless e- 
optimal strategy for player 1. By Lemma 16, for all k > 0, there exists an i > such that z\ = zf +1 , 
and by Lemma 17 it follows that the valuation z\ represents the maximal value obtained by /c-uniform 
memoryless strategies. Hence it follows that there exists k > and i > such that for all s G S we have 
> ((l))val(Safe(F))(s) - e. The desired result follows. I 

We now present the convergent strategy improvement algorithm for safety objectives as Algorithm 4 that 
iterates over /c-uniform strategy values. The algorithm iteratively calls Algorithm 3 with larger k, unless the 
termination condition of Algorithm 2 is satisfied. 

Theorem 8 (Monotonicity, Optimality on termination and Convergence). Let v t be the 
valuation obtained at iteration i of Algorithm 4. Then the following assertions hold. 

1. For all i > we have Vi + \ > Vi. 

2. If the algorithm terminates, then Vi = ((l)) V3 \(Safe(F)). 

3. For all e > 0, there exists i such that for all s we have Vi (s) > ((l)) va i (Safe(F))(s) — e. 

4. lim i ^ 00 v i = ((l)) va \(Safe(F)). 

Proof. We prove the results as follows. 

1. Let Vi is the valuation of Algorithm 4 at iteration i. For k > 0, we consider zf to denote the valuation 
of Algorithm 3 with the restriction of /c-uniform selector at iteration i, and let -Z^,^ denote the least 

fixpoint (i.e., i*(k) is the least value of i such that zf = zf +1 ). Since /c-uniform selectors are a 
subset of k + 1-uniform selectors, it follows that the maximal value obtained over strategies that uses 
k + 1-uniform selectors is at least the maximal value obtained over /c-uniform selectors. Since zf,^ 
denote the maximal value obtained over /c-uniform selectors (follows from Lemma 17), we have that 
< (note that we do not require that i*(k) < i*(k + 1), i.e., the algorithm with k + 1- 

uniform selectors may require more iterations to terminate). We have = z*„^ and hence the first 
result follows. 
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ad, 1/2 bd 



<3D 



ad, 1/2 



Figure 3: A simple game with irrational value. 



2. The result follows from Theorem 7. 

3. From Lemma 18 it follows that for all e > 0, there exists a k > such that for all s we have 
4(fc)( s ) ^ (( 1 ))vai(Safe(F))(s) - e. Hence v k > «l)) va |(Safe(F))(s) - e. Hence we have that for 
all e > 0, there exists > 0, such that for all s G 5 we have ffc(s) > ((l)) va |(Safe(i ? ))(s) — e. 

4. By part (1) for all i > we have Vi + \ > Vi. By part (3), for all e > 0, there exists i > such that for 
all s G S we have Vj(s) > ((l)) va |(Safe(F))(s) — e. Hence it follows that for all e > 0, there exists 
i > such that for all j > i and for all s G S we have ^(s) > ((l)) va |(Safe(F))(s) - e. It follows 
thatlim^oo^ = ((l)) va ,(Safe(F)). 

This gives us the following result. I 

Discussion on convergence of Algorithm 2. We will now present an example to illustrate that (contrary to 
the claim of Theorem 4.3 of [2]) Algorithm 2 need not converge to the values in concurrent safety games. 
However, as discussed before Algorithm 2 satisfies the monotonicity and optimality on termination, and for 
tun>based stochastic games (and also when restricted to fc-uniform strategies) converges to the values as 
termination is guaranteed. In the example we will also argue how Algorithm 4 converges to the values of 
the game. 

Example 3 Our example consists of two steps. In the first step we will present a gadget where the value is 
irrational and with probability 1 absorbing states are reached. 

Step 1. We first consider the game shown in Fig 3 with three states {sq, s±, S2} with two actions o, b for 
player 1 and c, d for player 2. The states sq, si are safe states, and S2 is a non-safe state. The transitions 
are as follows: (1) S\ and S2 are absorbing; and (2) in so we have the following transitions, (a) given action 
pairs ac and bd the next state is s±, (b) given action pair be the next state is S2, and (c) given action pair ad 
the next states are so and s\ with probability 1/2 each. In this game, the state so is transient, as given any 
action pairs, the set {s\, S2} of absorbing states is reached with probability at least 1/2 in one step. Hence 
the set {si, S2} is reached with probability 1, irrespective of the choice of strategies of the players. Hence 
in this game the objective for player 1 is equivalently to reach si. Let us denote by x the value of the game 
at sq, and let us consider the following matrix 



Then x = minmaxM. In other words, consider the valuation v x = (x, 1,0) for states so,si and S2, 
respectively, and x = minmaxM describes that v x = Pre\{v x ), and it is the least fixpoint of valuations 
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Figure 4: Counter example game. 



satisfying v = Pre±(v). We now analyze the value x at s$. The solution of x is achieved by solving the 
following optimization problem 

minimize x subject to y + ((1 — y) ■ x)/2 < x and 1 — y < x. 

Intuitively, y denotes the probability to choose move a in an optimal strategy. The solution to the optimiza- 
tion problem is achieved by setting x = 1 — y. Hence we have y + (1 — y) 2 /2 = (1 — y), i.e., (1 + y) 2 = 2. 
Since y must lie in the interval [0, 1], we have y = a/2 — 1, and thus we have x = 2 — y/2 < 0.6. We 
now analyze Algorithm 2 on this example. Let v,- L denote the valuation of the 2-th iteration, and let v® be the 
value at state so- Then we have < v® +1 and in the limit it converges to value 2 — ^/2. We observe that 
on this example Algorithm 2 exactly behaves as Algorithm 1 (strategy improvement for reachability) as the 
objective for player 1 is equivalently to reach s\, since so is transient. The reason of the strict inequality 
Vi < v® +l is as follows: if the valuation at state so in 2-th and i + 1-th iteration is the same, then by cor- 
rectness of Algorithm 1 it follows that the values would have been achieved in finitely many steps, implying 
convergence to a rational value at sq. The convergence to the values in the limit is due to correctness of 
Algorithm 1. 

Step 2. We will now augment the game of Step 1 to construct an example to show that Algorithm 2 does not 
necessarily converge to the values. Consider the game shown in Fig 4 augmenting the game of Fig 3 with 
some additional states (states S3, S4 and S5) and transitions (we only show the interesting transitions in the 
figure for simplicity). All the additional states shown are safe states. The value of state S5 is 0.6 (consider it 
as a probabilistic state going to state s\ with probability 0.6 and S2 with probability 0.4, and these edges are 
not shown in the figure). The transitions at state S3 and S4 are as follows: in state S3, player 1 can goto state 
so or S4 by choosing actions a and b, respectively (at S3 player 2 has only one action _L); and in state S4, 
player 2 can goto state S3 or S5 by choosing actions c and d, respectively (at S4 player 1 has only one action 
_L). We analyze Algorithm 2 on the example shown in Fig 4. In this game, at S3 player 1 starts by playing 
actions a and b uniformly, and player 2 responds by chosing action c. In the iterations of the algorithm it 
follows by the argument of Step 1, that the set / of Step 3.1 of Algorithm 2 is always non-empty as so G 
Hence in every iteration the value at so improves, and the strategy in S3 and S4 does not change. Hence the 
valuation at S3 converges to the valuation at so, i.e., to 2 — \[2 < 0.6. However, by switching to action b at 
S3, player 1 can enforce player 2 to play action d at S4 and ensure value 0.6. In other words, the value at S3 
is 0.6, whereas Algorithm 2 converges to 2 — y/2 < 0.6. 

The switching to action b would have been ensured by the turn-based construction of Step 3.3. For 
turn-based stochastic games or fc-uniform memoryless strategies, since convergence to values is guaranteed, 
the turn-based construction of Step 3.3 is also ensured to get executed. However, as the convergence to 
values in concurrent games is in the limit, Step 3.3 of Algorithm 2 may not get executed as shown by 
this example. However, we now illustrate that the valuations of Algorithm 4 converges to the values. We 
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consider Algorithm 4: Consider /c-uniform strategies, for a finite k > 2, then the value at so for fc-uniform 
strategies converges in finitely many steps to a value smaller than 0.6 (as it converges to a value smaller 
than the value at so), and then Step 3.3 of Algorithm 3 would get executed, and the value at S3 would be 
assigned to 0.6. In other words, for Algorithm 4 the values at S3, S4 and S5 are always set to 0.6, and the 
value at so converges in the limit to 2 — y/2. Thus with the example we show that though Algorithm 2 does 
not necessarily converge to the values, Algorithm 4 correctly converges to the values. I 

Retraction of Theorem 4.3 of [2]. In [2], the convergence of Algorithm 2 was claimed. Unfortunately the 
theorem is incorrect (with irreparable error) as shown by Example 3 and we retract the claim of Theorem 4.3 
of [2] of convergence of Algorithm 2 for concurrent games. 

Complexity. Algorithm 2 may not terminate in general; we briefly describe the complexity of every itera- 
tion. Given a valuation Vi, the computation of Pre\(vi) involves the solution of matrix games with rewards 
vf, this can be done in polynomial time using linear programming. Given Vi, if Pre±(vi) = vi, the sets 
OptSel(wj, s) and OptSelCount(uj, s) can be computed by enumerating the subsets of available actions at s 
and then using linear-programming. For example, to check whether (A, B) G OptSelCount(t> j, s) it suffices 
to check both of these facts: 

1. (A is the support of an optimal selector £1). there is an selector £1 such that (i) £1 is optimal (i.e. for 
all actions b G ^(s) we have Pre^ lj j ) (vi)(s) > t>i(s)); (ii) for all a G A we have £i(a) > 0, and for 
all a G" A we have £i(a) = 0; 

2. (B is the set of counter-optimal actions against £1). for all b G B we have Pre^ lt b(vi)(s) = Vi(s), 
and for all b G" B we have Pre^ 1 ^(vi)(s) > Vi(s). 

All the above checks can be performed by checking feasibility of sets of linear equalities and inequalities. 
Hence, TB(G, vi , F) can be computed in time polynomial in size of G and vi and exponential in the number 
of moves. We observe that the construction is exponential only in the number of moves at a state, and 
not in the number of states. The number of moves at a state is typically much smaller than the size of the 
state space. We also observe that the improvement step 3.3.2 requires the computation of the set of almost- 
sure winning states of a turn-based stochastic safety game: this can be done both via linear-time discrete 
graph-theoretic algorithms [4], and via symbolic algorithms [10]. Both of these methods are more efficient 
than the basic step 3.4 of the improvement algorithm, where the quantitative values of an MDP must be 
computed. Thus, the improvement step 3.3 of Algorithm 2 is in practice should not be inefficient, compared 
with the standard improvement steps 3.2 and 3.4. We now discuss the above steps for Algorithm 3. The 
argument is similar as above, but in case of fc-uniform selectors, we need to ensure that the witness selectors 
are fc-uniform which can be achieved with integer constraints. In other words, for Algorithm 3 the above 
checks are performed by checking feasibility of sets of integer linear equalities and inequalities (which can 
be achieved in exponential time). Again, the construction is exponential in the number of moves at a state, 
and not in the number of states. Hence we enumerate over sets of moves at a state (exponential in number 
of moves), and then need to solve integer linear constraints (the size of the integer linear constraints is 
polynomial in the number of moves, and is achieved in time exponential in the number of moves). Thus 
again the improvement step 3.3 of Algorithm 3 is polynomial in the size of the game, and exponential in the 
number of moves. 

7.2 Termination for Approximation 

In this subsection we present termination criteria for strategy improvement algorithms for concurrent games 
for e-approximation. 
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Termination for concurrent games. We apply the reachability strategy improvement algorithm (Algo- 
rithm 1) for player 2, for a reachability objective Reach(T), we obtain a sequence of valuations (uj)j>o such 
that (a) u i+ i > Ui, (b) if u i+1 = u h then m = ((2)) va i (Reach(T)); and (c) lim^oo m = ((2)) va i (Reach(T)). 
Given a concurrent game G with F C S and T = S \ F, we apply Algorithm 1 to obtain the sequence 
of valuation (uj)j>o as above, and we apply Algorithm 4 to obtain a sequence of valuation («j)j>o. The 
termination criteria are as follows: 

1. if for some i we have Ui+i = Ui, then we have Uj = ((2)) va i(Reach(T)), and 1 — U{ = 
((l))vai(Safe(F)), and we obtain the values of the game; 

2. if for some % we have Vi+i = Vi, then we have 1—Vi = ((2)) va |(Reach(T)), and V{ = ((l)) va i(Safe(F)), 
and we obtain the values of the game; and 

3. for e > 0, if for some i > 0, we have m + Vi > 1 — e, then for all s G S we have Vi(s) > 
((l))vai(Safe(F))(s) — e and Ui(s) > ((2)) va |(Reach(T))(s) — e (i.e., the algorithm can stop for e- 
approximation). 

Observe that since («i)i>o and (v«)i>o are both monotonically non-decreasing and ((l)) va |(Safe(F)) + 
((2)) va |(Reach(T)) = 1, it follows that if ui + Vi > 1 — e, then forall j > i we have Uj > Uj — e and 
Vi > Vj — e. This establishes that ui > ((l)) va [(Safe(F)) — e and Vi > ((2)) va |(Reach(T)) — e; and the 
correctness of the stopping criteria (3) for e-approximation follows. We also note that instead of applying 
the reachability strategy improvement algorithm, a value-iteration algorithm can be applied for reachability 
games to obtain a sequence of valuation with properties similar to (tij)j>o and the above termination criteria 
can be applied. 

Theorem 9 Let G be a concurrent game structure with a safety objective Safe(F). Algorithm 4 and 
Algorithm 1 for player 2 for the reachability objective Reach(S \ F) yield two sequences of monotonic 
valuations (vi)i>o and (u«)i>o, respectively, such that (a) for all i > 0, we have Vi < ((l)) va \(Safe(F)) < 
1 - and (b) Hindoo Vi = lim^oo 1 - Ui = ((l)) va \(Safe(F)). 

Bounds for approximation. We now discuss the bounds for approximation for concurrent games with 
reachability objectives, which follows from the results of [18, 17]. It follows from the results of [18] that 
for all e > 0, there exist /c-uniform memoryless optimal strategies for concurrent reachability and safety 
games G, where k is bounded by (I) 2 ° (|G|) it follows that for all e > 0, if we consider our strategy 
improvement algorithm (restricted to A;-uniform selectors) for reachability games, then upon termination 
the valuation obtained is an e-approximation of the value function of the game, where k is bounded by 
(i) 2 ° (|G|) . Using the restriction to /c-uniform memoryless strategies, along with the reduction of concurrent 
games to turn-based stochastic game for /c-uniform memoryless strategies and the termination bound for 
turn-based stochastic games we obtain a double exponential bound on the number of iterations required for 
termination (note that if k = (^) 2 ° < ' G ' I) , then the total number of /c-uniform memoryless strategies is fc°(l G D, 
which is double exponential) (also see [17] for details). Moreover, the recent result of [17] shows that the 
double exponential bound is near optimal for the strategy improvment algorithm for concurrent games with 
reachability objectives. 

Approximation of strategies. The previous method to solve concurrent reachability and safety games was 
the value-iteration algorithm. The witness strategy produced by the value-iteration algorithm for concurrent 
reachability games is not memoryless; and for concurrent safety games since the value-iteration algorithm 
converges from above it does not provide any witness strategies. The only previous algorithm to approximate 
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memoryless e-optimal strategies, for e > 0, for concurrent reachability and safety games is the naive algo- 
rithm that exhaustively searches over the set of all /c-uniform memoryless strategies (such that the /c-uniform 
memoryless strategies suffices for e-optimality and /c-depends in e). Our strategy improvement algorithms 
for concurrent reachability and safety games are the first strategy search based approach to approximate 
e-optimal strategies. 
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Algorithm 3 /c-Uniform Restricted Safety Strategy-Improvement Algorithm 



Input: a concurrent game structure G with safe set F, and number k. 
Output: a strategy 7 for player 1 . 

0. Compute W 1 = {s G S | ((l)) va |(Safe(F))(s) = 1}; and k = max{£;, |M|}. 

1 . Let 70 = C\ an d i = 0. 

2. Compute v = <<l)>S(Safe(F)). 

3. do{ 

3.1. Let I k = {s e S \ (Wi U T) | sup 5leAfc(s) Pre^v^s) > Vi {s)}. 
3.2 if I k / 0, then 

3.2.1 Let £1 be a fe-uniform player- 1 selector such that for all states s£l, 
we have Pre 1: ^(vi)(s) = sup ? / GA fc (s) Pre 1: ^(vi)(s) > Vi(s). 

3.2.2 The player-1 selector 7^+1 is defined as follows: for each state s 6 S, let 




3.3 else 



3.3.1 let(G* ,F) = TB(G, Vi ,F,k) 



3.3.2 let A i be the set of almost-sure winning states in G v . for Safe(F) and 



7Ti be a pure memoryless almost-sure winning strategy from the set A { . 



3.3.3 if ((A k i nS)\ Wi t^0) 



3.3.3.1 let C/ = (4 fc n,S)\^i 




3.5. Let i = « + 1. 



} until 4 = and (A^ n 5) \ Wi = 0. 
4. return 7^. 



36 



Algorithm 4 Convergent Safety Strategy-Improvement Algorithm 



Input: a concurrent game structure G with safe set F. 
Output: a strategy 7 for player 1 . 

0. k = \M\ andz = 0. 

1. do{ 

1.1 7i+1 = Algorithm 3(G, F, k) 

1.2 Compute v l+1 = ((l))^ 1 (Safe(F)) 

1.3 Let 1 = {s6 S \ (Wi U T) I Prei(«i)(s) > v^s)}. 
lAhet(G Vi ,F) = JB{G,v i ,F) 

1.4.1 let be the set of almost-sure winning states in for Safe(i ? ). 
1.5 Let i = i + l and k = k + 1. 
} until / = and (Aj_i n 5) \ Wi = 0. 

2. return 7i . 
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8 Technical Appendix 

We now present the details of restriction to /c-uniform selectors, and the details of the notations used in 
Algorithm 3. The definitions are essentially same as for selectors and optimal selectors, but restricted to 
/c-uniform selectors. 

Optimal /c-uniform selectors. For k > 0, a valuation v and a state s, let 

Pre\(y)(s) = sup Pre-y.g (v)(s). 

denote the optimal one-step value among /c-uniform selectors. For k > 0, given a valuation v and a state s, 
we define by 

OptSeKu, s, k) = {6 G Af (s) | Prei : fr(«)(s) = Pre*( V )( S )} 

the set of optimal selectors among /c-uniform selectors for v at state s. For a /c-uniform optimal selector 
£i G OptSel(f , s, k), we define the set of counter-optimal actions as follows: 

CountOpt(w, s, £i, k) = {b G r 2 (s) | Pre Cl)6 (v)(s) = PreJ(w)(s)}. 

Observe that for £i G OptSel(w, s, k), for all 6 G r 2 (s) \ CountOpt(i;, s, £i, fe) we have Pre^ 1) (,(w)(s) > 
Pre\{v){s). We define the set of /c-uniform optimal selector support and the counter-optimal action set as 
follows: 

OptSelCount(t;, s, k) = {(A, B) C T^s) x T 2 (s) | G A{(s). 6 G OptSel(v, s, k) 

A Suppfa) = A A CountOpt(u,s,^i,fe) = P}; 

i.e., it consists of pairs (A, B) of actions of player 1 and player 2, such that there is a /c-uniform optimal 
selector £i with support A, and B is the set of counter-optimal actions to £i. 

Turn-based reduction. Given a concurrent game G = (S 1 , M, Ti, r 2 , <5), a valuation v, and bound /c for 

k _ 

/c-uniformity we construct a turn-based stochastic game G v = ((S, E), (S±, S2, Sr),S) as follows: 

1. The set of states is as follows: 

S = S U {(s, A,B) \ s e S, (A, B) G OptSelCount(w, s, k)} 
U {(s,A,b) I s G S, (A, B) G OptSelCount(w, s, k), b G P}. 

2. The state space partition is as follows: Si = S; S2 = {(s,A,B) | s G 5, (A, B) G 
OptSelCount(i;, s,/c)}; and ~S R = {(s,A,b) \ s G S, (A,B) G OptSelCount(u, s, k), b G B}. 
In other words, (Si,S2,Sr) is a partition of the state space, where Si are player 1 states, S2 are 
player 2 states, and £r are random or probabilistic states. 

3. The set of edges is as follows: 

E = {(s, (s, A, B)) I s G 5, (A, B) G OptSelCount(f , s, k)} 

U {((a, A, B), (s, A, b)) I b G B} U {((a, A, b),t) \ t G Q Pest(s, a, 6)}. 

4. The transition function 5 for all states in Sr is uniform over its successors. 
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Intuitively, the reduction is as follows. Given the valuation v, state s is a player 1 state where player 1 can 
select a pair (A, B) (and move to state (s, A, B)) with A C T\{s) and B C ^(s) such that there is a k- 
uniform optimal selector £1 with support exactly A and the set of counter-optimal actions to £1 is the set B. 
From a player 2 state (s,A,B), player 2 can choose any action b from the set B, and move to state (s, A, b). 
A state (s, A, b) is a probabilistic state where all the states in IJaeA Dest(s, a, b) are chosen uniformly at 
random. Given a set F C S we denote by F = F U {(s, A, B) G ~S \ s & F} U {(s, A, b) G ~S \ s G F}. 
We refer to the above reduction as TB, i.e., (G^ , F) = TB(G, F, /c). 

Proof, (of Lemma 17). The proof of the result is essentially identical as the proof of Theorem 7, and we 
present the details for completeness. Let Vi = zf. We show that for all /c-uniform memory less strategies tt\ 
for player 1 we have ((l))^ a 1 | (Safe(F)) < Vi . 

Let TT2 be a pure memoryless optimal strategy for player 2 in G^ . for the objective complementary 

to Safe(F), where (G^., Safe(F)) = TB(G, Vi, F, k). Consider a fc-uniform memoryless strategy tt\ for 
player 1, and we define a pure memoryless strategy TT2 for player 2 as follows. 

1. If 7Ti(s) G" OptSel(fj, s, k), then 7r 2 (s) = b G T2(s), such that Pre wi ( s ^ b (v j)(s) < Vi(s); (such a b 
exists since 7Ti(s) g" OptSel(fj, s, k)). 

2. If vri(s) G OptSel(wj, s, k), then let A = S'upp(7ri(s)), and consider B such that 73 = 
CountOpt(f i, s, vri(s), k). Then we have ^(s) = b, such that vf2((s, A, B)) = (s, A, b). 

Observe that by construction of TT2, for all s G S \ {W\ U T), we have Pre 7T1 ^ s ^ JT2 ^(vi)(s) < Vi(s). We 
first show that in the Markov chain obtained by fixing tt\ and -K2 in G, there is no closed connected recurrent 
set of states C such that C Q S \ (W\ U T). Assume towards contradiction that C is a closed connected 
recurrent set of states in S \ (W\ U T). The following case analysis achieves the contradiction. 

1. Suppose for every state s G C we have tti(s) G OptSel(vj, s, k). Then consider the strat- 
egy 7fi in G v . such that for a state s G C we have vfi(s) = (s,A, 5), where tti(s) = A, 
and B = CountOpt(uj, s, 7Ti(s), k). Since C is closed connected recurrent states, it follows by 
construction that for all states s G C in the game G v . we have Pr^ 1 ' 7r2 (Safe(C')) = 1, where 
C = C U {(s, A, 5) | s G C} U | s G C}. It follows that for all s G C in G^. we 
have Pr™(Safe(F)) = 1. Since vf 2 is an optimal strategy, it follows that C C (if n 5) \ H^i. This 
contradicts that (A* n S 1 ) \ Wi = 0. 

2. Otherwise for some state s* G G we have vri(s*) OptSel(w j, s*, A;). Let r = min{g | 
U q (vi) n C / I}, i.e., r is the least value-class with non-empty intersection with C. Hence it 
follows that for all q < r, we have U q (vi) n G = 0. Observe that since for all s G G we have 
-P re 7ri(s),7r 2 (s)(' l, i)( s ) ^ ^i(s)> it follows that for all s G U r (vi) either (a) Dest(s, vri(s), ^(s)) C 
U r (vi); or (b) Dest(s, 7Ti(s), 7r2(s))nC/g(fj) / 0, for some g < r. Since U r {vi) is the least value-class 
with non-empty intersection with G, it follows that for all s G U r (vi) we have Dest(s, vri(s), ^(s)) C 
U r (vi). It follows that G C U r (vi). Consider the state s* G G such that vri(s*) OptSel(wj, s, k). 
By the construction of ^(s), we have Fre 7ri ( s ») 7r2 ( s »)(t'j)(s*) < vi(s*). Hence we must have 
Dest(s*, vri(s*), vt2(s*)) n U q (vi) / 0, for some q < r. Thus we have a contradiction. 

It follows from above that there is no closed connected recurrent set of states in S \ {W\ U T), and hence 
with probability 1 the game reaches W\ U T from all states in S \ {W\ U T). Hence the probability to 
satisfy Safe(F) is equal to the probability to reach W\. Since for all states s G S \ (W\ U T) we have 
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P re TT 1 (s),TT 2 (s)(. v i)( s ) ^ v i( s )> it follows that given the strategies tt\ and tt2, the valuation Vi satisfies all the 
inequalities for linear program to reach W±. It follows that the probability to reach W\ from s is atmost 
Vi(s). It follows that for all s G S \ (Wi U T) we have ((l))^ a 1 | (Safe(F))(s) < Vi (s). This completes the 
proof. I 
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